acer

those external calls...

Answered: acer 32348

February 15 2011

0 0

The following call to display (internal Maple Library) routine source shows me that Arraytools:-AddAlongDimension is using Nag's f06ecf for the datatype=float[8] case.

showstat(ArrayTools::AddAlongDimension2D);

And google shows me that Nag's f06ecf is the the same as the standard BLAS function daxpy.

But there are two extra arguments in that external call from the `AddAlongdimension2D` routine. Ie.,

    ExtCall(ncols,alpha,y,i-1,incy,x,0,incx)

The extra two arguments are what comes right after arguments y and x. The `i-1` and the `0`. These are offsets into the contiguous memory used for the double-precision data of those two Maple float[8] rtables.

Such offsets allow the daxpy routines (which is "technically" for 1-d arrays of data) to operate along any row or column of a 2-d array of data. Simply offset to the entry that starts the row of column in question. And then stride (incx, or incy) the data by m, n, or 1, according to whether it is stored in Fortran-order or C-order (column-major or row-major). For example, when passing fortran-order array `x` in C one could (carefully) pass it as daxpy(..., x+3,...) and so offset by so many double widths the initial address that daxpy sees.

those external calls...

Commented: acer 32348

February 15 2011

@Axel Vogt

The following call to display (internal Maple Library) routine source shows me that Arraytools:-AddAlongDimension is using Nag's f06ecf for the datatype=float[8] case.

showstat(ArrayTools::AddAlongDimension2D);

And google shows me that Nag's f06ecf is the the same as the standard BLAS function daxpy.

But there are two extra arguments in that external call from the `AddAlongdimension2D` routine. Ie.,

    ExtCall(ncols,alpha,y,i-1,incy,x,0,incx)

The extra two arguments are what comes right after arguments y and x. The `i-1` and the `0`. These are offsets into the contiguous memory used for the double-precision data of those two Maple float[8] rtables.

Such offsets allow the daxpy routines (which is "technically" for 1-d arrays of data) to operate along any row or column of a 2-d array of data. Simply offset to the entry that starts the row of column in question. And then stride (incx, or incy) the data by m, n, or 1, according to whether it is stored in Fortran-order or C-order (column-major or row-major). For example, when passing fortran-order array `x` in C one could (carefully) pass it as daxpy(..., x+3,...) and so offset by so many double widths the initial address that daxpy sees.

ok, understood...

Answered: acer 32348

February 14 2011

0 0

My procedure `P` calls the Nag name of the BLAS function daxpy, I believe. And that does this (in pseudo-syntax):

x -> alpha*y + x

which is an inplace operation on `x`. So `P` uses a 1x1 float[8] Vector for x. And it walks y with a stride of 1, and x with a stride of 0 (!). So x ends up accumulating the sum of all that is in the n-Vector `y`. Something like that.

And `y` is just an Alias of the original 2D Array as a 1D Vector. So the end result is that all the entries in the Array are summed, in one big "add-along" shot.

You could try writing/finding an inplace sorter, and compiling it. (...and maybe give its define_external call the THREAD_SAFE options, down the road.) It needn't sort inplace on the data rtable (the Aliased Vector, say). It might be easier to find one which merely wrote out the sorted result to another temp Vector. And that Vector might be re-usable. (That way is far easier for handling, and is good for non-threaded parent code. But when parallelizing it becomes tricky again as each thread needs its own re-uable temp Vector to hold sorted results.) This is the interesting bit. I see where you are at now. You'd like to test a fast version of this, sorting to get it ready for Kahan or similar carefull summation. Fun.

The BLAS on Windows will be MKL. I'm not sure whether their daxpy is threaded, or just cache-tuned. As you can see, 5000 summations of 110x110 float[8] Arrays in undr 1 second is not bad. Unfortunately, the Intel people say that one is not advised to call MKL from within a thread of another threaded app. They don't come right out and say that MKL is wholly, or partly, thread-unsafe. I've never tried to force it by placing the option on that daxpy's define_external call from within Maple. On Linux the BLAS is ATLAS, and I've never see a described limitation on whether it can be called from within a higher level thread.

You may note that `P` operated on full rectangular storage 2D Arrays (albeit, Aliased). Having my two procs be clever about handling triangular data could be a lot of work... for only a single power of 2 benefit. The daxpy is so quick, and copying out subsections from below the main diagonal (or making repeated calls with tricky offsets) might well blow away such a small speedup. I'd be tempted to stick with full rectangular storage, thus permitting Maple to do the Alias and a single external call daxpy. I'd likely only suggest divvying up the data if it were very sparse. (And in which case we could have even more fun discussing the sparse float[8] structure.)

It'd be interesting to hear how you get along with this task.

nb. I'm not saying that your code is affected by the following. But you might care to see this post on 2D Math runtime efficiency. I saw a wickly brutal double-whammy of this last month or so. Someone had gotten bitten not only by the delayed multiplication parsing within a proc defn, but had alongside that used syntax A[i][j] instead of A[i,j] for Matrix entry access. But the former creates A[i] a whole new temp row Vector, which is then accessed at its jth entry and then disposed of as garbage. The overall penalty was a slowdown of a few hundred times...

acer

ok, understood...

Commented: acer 32348

February 14 2011

My procedure `P` calls the Nag name of the BLAS function daxpy, I believe. And that does this (in pseudo-syntax):

x -> alpha*y + x

which is an inplace operation on `x`. So `P` uses a 1x1 float[8] Vector for x. And it walks y with a stride of 1, and x with a stride of 0 (!). So x ends up accumulating the sum of all that is in the n-Vector `y`. Something like that.

And `y` is just an Alias of the original 2D Array as a 1D Vector. So the end result is that all the entries in the Array are summed, in one big "add-along" shot.

You could try writing/finding an inplace sorter, and compiling it. (...and maybe give its define_external call the THREAD_SAFE options, down the road.) It needn't sort inplace on the data rtable (the Aliased Vector, say). It might be easier to find one which merely wrote out the sorted result to another temp Vector. And that Vector might be re-usable. (That way is far easier for handling, and is good for non-threaded parent code. But when parallelizing it becomes tricky again as each thread needs its own re-uable temp Vector to hold sorted results.) This is the interesting bit. I see where you are at now. You'd like to test a fast version of this, sorting to get it ready for Kahan or similar carefull summation. Fun.

The BLAS on Windows will be MKL. I'm not sure whether their daxpy is threaded, or just cache-tuned. As you can see, 5000 summations of 110x110 float[8] Arrays in undr 1 second is not bad. Unfortunately, the Intel people say that one is not advised to call MKL from within a thread of another threaded app. They don't come right out and say that MKL is wholly, or partly, thread-unsafe. I've never tried to force it by placing the option on that daxpy's define_external call from within Maple. On Linux the BLAS is ATLAS, and I've never see a described limitation on whether it can be called from within a higher level thread.

You may note that `P` operated on full rectangular storage 2D Arrays (albeit, Aliased). Having my two procs be clever about handling triangular data could be a lot of work... for only a single power of 2 benefit. The daxpy is so quick, and copying out subsections from below the main diagonal (or making repeated calls with tricky offsets) might well blow away such a small speedup. I'd be tempted to stick with full rectangular storage, thus permitting Maple to do the Alias and a single external call daxpy. I'd likely only suggest divvying up the data if it were very sparse. (And in which case we could have even more fun discussing the sparse float[8] structure.)

It'd be interesting to hear how you get along with this task.

nb. I'm not saying that your code is affected by the following. But you might care to see this post on 2D Math runtime efficiency. I saw a wickly brutal double-whammy of this last month or so. Someone had gotten bitten not only by the delayed multiplication parsing within a proc defn, but had alongside that used syntax A[i][j] instead of A[i,j] for Matrix entry access. But the former creates A[i] a whole new temp row Vector, which is then accessed at its jth entry and then disposed of as garbage. The overall penalty was a slowdown of a few hundred times...

acer

nice questions...

Commented: acer 32348

February 13 2011

This is interesting. I'd like to look at it, but likely cannot work on it for a few days...

acer

thanks...

Answered: acer 32348

February 11 2011

0 0

@Alejandro Jakubi I hadn't noticed that this was branched from another post. Thanks.

thanks...

Commented: acer 32348

February 11 2011

@Alejandro Jakubi I hadn't noticed that this was branched from another post. Thanks.

just as arbitrary...

Answered: acer 32348

February 11 2011

0 0

@Alejandro Jakubi That may work for this example. But it also might be entirely beside the point. (Ok, so we don't know for sure what the OP's actual criteria are. But we may guess intelligently. I suggested that it was for u,v,w to appear less often in the result. Nobody's suggested otherwise, or another guess. Who knows, maybe the only criterion is expression length or lack of RootOfs. I was discussing the situation under my own guess.)

Your picking V to exclude, merely happens to produce some result in which u,v,w appear less often as a particularity of the example. But that's just a particular of the example. I see no method at all in your selecting V for exclusion from the trig terms as a `solve` variable aside from visual inspection or noticing the nature of eqn2, etc.

So your suggestion is much like my own first `solve` suggestion above, rather than what I was attempting with `G`.

Of course, I expect that you, Alejandro, are quite aware of these distinctions. So I'm commenting more for the benefit of the OP.

One could also simply invoke `isolate` on eqn2, in order to get the supposedly "better" formula for sin(b). That's even easier. But I didn't suggest it because it shares the aspect of arbitrariness.

As mentioned, we don't know what the OP really wants. Consider the sin(a) solution from A.J.'s `solve` call. Is it more, less, or equally desirable as this below?

> solve({eqn1, eqn2, eqn3}, {sin(a),cos(a),sin(b)});


       /            u                  w               v\ 
      { cos(a) = --------, sin(a) = --------, sin(b) = - }
       \         V cos(b)           V cos(b)           V/

For some other, different example, including V rather than excluding it might be key. It's problem specific. By trying a Groebner basis methodology for handling it as a polynomial system, I was trying to enforce priority amongst the variables (where I made a supposition that u,v,w were undesirable). Maybe it could be done better. Or maybe the OP just wants short solutions. Or the desire might be as simple as just wanting any solutions without RootOfs.

just as arbitrary...

Commented: acer 32348

February 11 2011

@Alejandro Jakubi That may work for this example. But it also might be entirely beside the point. (Ok, so we don't know for sure what the OP's actual criteria are. But we may guess intelligently. I suggested that it was for u,v,w to appear less often in the result. Nobody's suggested otherwise, or another guess. Who knows, maybe the only criterion is expression length or lack of RootOfs. I was discussing the situation under my own guess.)

Your picking V to exclude, merely happens to produce some result in which u,v,w appear less often as a particularity of the example. But that's just a particular of the example. I see no method at all in your selecting V for exclusion from the trig terms as a `solve` variable aside from visual inspection or noticing the nature of eqn2, etc.

So your suggestion is much like my own first `solve` suggestion above, rather than what I was attempting with `G`.

Of course, I expect that you, Alejandro, are quite aware of these distinctions. So I'm commenting more for the benefit of the OP.

One could also simply invoke `isolate` on eqn2, in order to get the supposedly "better" formula for sin(b). That's even easier. But I didn't suggest it because it shares the aspect of arbitrariness.

As mentioned, we don't know what the OP really wants. Consider the sin(a) solution from A.J.'s `solve` call. Is it more, less, or equally desirable as this below?

> solve({eqn1, eqn2, eqn3}, {sin(a),cos(a),sin(b)});


       /            u                  w               v\ 
      { cos(a) = --------, sin(a) = --------, sin(b) = - }
       \         V cos(b)           V cos(b)           V/

For some other, different example, including V rather than excluding it might be key. It's problem specific. By trying a Groebner basis methodology for handling it as a polynomial system, I was trying to enforce priority amongst the variables (where I made a supposition that u,v,w were undesirable). Maybe it could be done better. Or maybe the OP just wants short solutions. Or the desire might be as simple as just wanting any solutions without RootOfs.

&+-...

Commented: acer 32348

February 09 2011

The parsing of the infix &+- operator is also different. It seems like a difference in precedence.

In 1D Maple notation and without brackets to delimit the 1/2 the input

x &+- 1/2;

produces the output object `&+-`(x,1)/2 while in 2D Math mode it produces the output object `&+-`(x,1/2);

acer

new version, about twice as fast again...

Answered: acer 32348

February 04 2011

0 0

I further cut down the objective function's individual execution time by around another factor of 2, by having the worksheet create lambda, X, NX, beta, g, etc, explictly as float[8] Vectors instead of implicitly as Maple tables.

I've also run it as an optimization problem, using each of the Optimization, GlobalOptimization, and DirectSearch (ver.2) packages. I used simple bounds (ranges) on all 5 variables. I pretty much had them each rely on default options, and I didn't try to tweak any of the solvers for speed by adjusting options.

F2proc_newer_stil.mw

I supplied an 'objectivegradient' procedure for Optimization:-NLPSolve to use. (That could be made even simpler, by having loop and seq over the fdiff calls in that procedure, instead of having all 5 lines be explicit.)

I inserted a few lines into the objective, to allow reporting when a better objective value was found. Mostly for fun, but also partly because I wanted to see its progress rate since speed performance was a prior issue.

acer

new version, about twice as fast again...

Commented: acer 32348

February 04 2011

I further cut down the objective function's individual execution time by around another factor of 2, by having the worksheet create lambda, X, NX, beta, g, etc, explictly as float[8] Vectors instead of implicitly as Maple tables.

I've also run it as an optimization problem, using each of the Optimization, GlobalOptimization, and DirectSearch (ver.2) packages. I used simple bounds (ranges) on all 5 variables. I pretty much had them each rely on default options, and I didn't try to tweak any of the solvers for speed by adjusting options.

F2proc_newer_stil.mw

I supplied an 'objectivegradient' procedure for Optimization:-NLPSolve to use. (That could be made even simpler, by having loop and seq over the fdiff calls in that procedure, instead of having all 5 lines be explicit.)

I inserted a few lines into the objective, to allow reporting when a better objective value was found. Mostly for fun, but also partly because I wanted to see its progress rate since speed performance was a prior issue.

acer

congratulations...

Commented: acer 32348

February 04 2011

The improvements in version 2 are quite apparent. I've really only just begun trying to go a little deeper in investigating using this package. The help pages in version 2 are a big improvement. There's obviously a large amount of development and authoring effort that lies behind this package. Great work.

I have one minor issue, and one (in my opinion) more serious one to report.

When supplying simple bounds as constraints, but without supplying an initial point, the `Search` routine seems to generate it own initial point. But it can sometimes select a point which violates those constraints. Yes, it does emit a Warning for this, but couldn't the routine automatically generate an initial point that satisfies all simple bound constrains (eg. [x>=-1, x<1, etc])? This may be minor.

A more serious objection is that the names used in (in)equality constraints, as expressions, are described as having to match the names of the formal parameters in the case that the objective is a procedure.

acer

E-Mail Address:
Password:
Remember Me:	Automatically sign in on future visits

E-Mail Address:
Password:
Remember Me:	Automatically sign in on future visits

Ask a Question

Create a Post

32348 Reputation

29 Badges

Social Networks and Content at Maplesoft.com

MaplePrimes Activity

These are replies submitted by acer

sort[inplace]...

sort[inplace]...

those external calls...

those external calls...

ok, understood...

ok, understood...

nice questions...

thanks...

thanks...

just as arbitrary...

just as arbitrary...

&+-...

new version, about twice as fast again...

new version, about twice as fast again...

congratulations...

Save this setting as your default sorting preference?

Ask a Question

Create a Post

Generating PDF…

Save this setting as your default sorting preference?
Note: You can change your preference any time in your account settings
Don't show this again

From:
To:

Custom Message (optional):