dohashi

1172 Reputation

10 Badges

16 years, 79 days
I am a Senior Software Developer in the Kernel Group, working on the Maple language interpreter. I have been working at Maplesoft since 2001 on many aspects of the Kernel, however recently I have been focusing on enabling parallel programming in Maple. I have added various parallel programming tools to Maple, and have been trying to teaching parallel programming techniques to Maple programmers. I have a Master's degree in Mathematics (although really Computer Science) from the University of Waterloo. My research focused on Algorithm and Data Structure, Design and Analysis.

MaplePrimes Activity


These are replies submitted by dohashi

@TomJohnRiddle Hmmm... You said that while you are waiting, the CPU utilization is basically zero, so you're not seeing Maple's kernel (mserver process) running at 100%, correct?  If you were seeing the kernel running during that time, I would suspect that there was some seriously inefficent algorithm being used in the kernel.  However if you are not seeing that, and it is not a latency problem of trying to pull each element across the network one at a time, then I'm not sure what else it could be.

What version of MS SQL server are you using and what is the type of data stored in the table?

Thanks

Darin

There is a difference between calling a method using the standard function call syntax and using :-.

> module A() option object;
>     export foo::static := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'A' end;
> end module:
>
> module B() option object;
>     export foo::static := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'B' end;
> end module:
>
> foo := proc() print( 'procname'( _passed ) ) end proc:
>
> foo( A );
                                        A:-foo(A)

> foo( B );
                                        B:-foo(B)

> foo( A, B );
                                      A:-foo(A, B)

> foo( B, A );
                                      B:-foo(B, A)

 

Notice how the following two calling sequences differ:


> foo( A, B );
                                      A:-foo(A, B)

> A:-foo(B);

                                        A:-foo(B)

 

In the first case, the object A is passed as a parameter into A:-foo, in the second case it is not.  With objects, methods are generally declared static.  Thus if the methods are going to modify the object (A in this case), it needs to be passed in as a parameter.

So foo( A, B ) and A:-foo( B ) are very different.

I'm not sure what you mean by "nested objects".  When scanning a parameter sequence, Maple is looking for objects, anything else (like a list containing objects) is not an object.

With respect to :-, ( I think you accidently wrote := ), :- is used to specify which package a name is found in.  When used as a unary prefix operator, it specifies the global, top level name.  This is generally used when there exists a module or package procedure that would be used if the :- was not present.  For an object method, the global name is the name that is mapped to the object method, so :-foo( A ) and foo( A ) are the same, assuming foo is not overloaded.  When foo is overloaded, then :- can be used to access the global foo, which is the foo that maps onto the object method.

> module A() option object;
>     export foo := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'A' end;
> end module:
>
> module fooExporter()
>     option package;
>
>     export foo := proc() print( 'procname'( _passed ) ) end proc;
> end:
>
> foo( A );
                                        A:-foo(A)

> :-foo( A );
                                        A:-foo(A)

>
> fooExporter:-foo( A );
                                   fooExporter:-foo(A)

> with( fooExporter );
                                          [foo]

>
> foo( A );
                                   fooExporter:-foo(A)

> :-foo( A );

                                        A:-foo(A)

Darin

-- Kernel Developer Maplesoft

There is a difference between calling a method using the standard function call syntax and using :-.

> module A() option object;
>     export foo::static := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'A' end;
> end module:
>
> module B() option object;
>     export foo::static := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'B' end;
> end module:
>
> foo := proc() print( 'procname'( _passed ) ) end proc:
>
> foo( A );
                                        A:-foo(A)

> foo( B );
                                        B:-foo(B)

> foo( A, B );
                                      A:-foo(A, B)

> foo( B, A );
                                      B:-foo(B, A)

 

Notice how the following two calling sequences differ:


> foo( A, B );
                                      A:-foo(A, B)

> A:-foo(B);

                                        A:-foo(B)

 

In the first case, the object A is passed as a parameter into A:-foo, in the second case it is not.  With objects, methods are generally declared static.  Thus if the methods are going to modify the object (A in this case), it needs to be passed in as a parameter.

So foo( A, B ) and A:-foo( B ) are very different.

I'm not sure what you mean by "nested objects".  When scanning a parameter sequence, Maple is looking for objects, anything else (like a list containing objects) is not an object.

With respect to :-, ( I think you accidently wrote := ), :- is used to specify which package a name is found in.  When used as a unary prefix operator, it specifies the global, top level name.  This is generally used when there exists a module or package procedure that would be used if the :- was not present.  For an object method, the global name is the name that is mapped to the object method, so :-foo( A ) and foo( A ) are the same, assuming foo is not overloaded.  When foo is overloaded, then :- can be used to access the global foo, which is the foo that maps onto the object method.

> module A() option object;
>     export foo := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'A' end;
> end module:
>
> module fooExporter()
>     option package;
>
>     export foo := proc() print( 'procname'( _passed ) ) end proc;
> end:
>
> foo( A );
                                        A:-foo(A)

> :-foo( A );
                                        A:-foo(A)

>
> fooExporter:-foo( A );
                                   fooExporter:-foo(A)

> with( fooExporter );
                                          [foo]

>
> foo( A );
                                   fooExporter:-foo(A)

> :-foo( A );

                                        A:-foo(A)

Darin

-- Kernel Developer Maplesoft

@Jarekkk When timing parallel code, you need to use time[real]() instead of time().  time() measures CPU time, which sums the time spend on each cpu when multiple CPUs are running.

Also, when running this example, even 10^5 is relatively short.  However using time[real] and 10^7, I see a decent speed up.  Also, notice that this example only creates two tasks, so the speed up won't be larger that 2 times.  To utilize more that 2 cpus you will need to sub-divide the for loops themselves.

> restart;
> p1 := proc(N)
>       local i, result;
>  
>       for i from 1 to N
>       do
>           result[i] := 2*i-1
>       end;
>  
>       op(result);
>  end:
>  
>  p2 := proc(N)
>       local i, result;
>  
>       for i from 1 to N
>       do
>           result[i] := 2*i;
>       end;
>  
>       op(result);
>  end:
>  
>  st:=time[real]():
>  Threads:-Task:-Start( passed, Task=[p1,10^7], Task=[p2,10^7] ):
memory used=11.4MB, alloc=37.1MB, time=0.18
memory used=25.3MB, alloc=53.1MB, time=0.41
memory used=52.7MB, alloc=85.1MB, time=0.91
memory used=88.7MB, alloc=125.1MB, time=1.40
memory used=189.3MB, alloc=253.9MB, time=3.31
memory used=331.4MB, alloc=381.9MB, time=5.09
memory used=470.8MB, alloc=381.9MB, time=6.81
memory used=610.5MB, alloc=512.0MB, time=8.61
memory used=757.0MB, alloc=768.0MB, time=12.52
memory used=1172.7MB, alloc=900.0MB, time=16.42
>  time[real]()-st;
                                         10.863

>
> restart;
> st := time[real]():
> for i to 10^7 do results[i] := 2*i-1 end do:
memory used=1307.9MB, alloc=18.0MB, time=19.21
memory used=1321.7MB, alloc=42.0MB, time=19.36
memory used=1344.0MB, alloc=66.0MB, time=19.65
memory used=1388.1MB, alloc=122.0MB, time=20.25
memory used=1447.3MB, alloc=253.7MB, time=20.81
memory used=1536.9MB, alloc=253.7MB, time=22.10
memory used=1627.1MB, alloc=382.7MB, time=23.39
memory used=1685.0MB, alloc=383.0MB, time=24.65
memory used=1807.8MB, alloc=511.0MB, time=25.96
> for i from 10^7+1 to 2*10^7 do results[i] := 2*i end do:
memory used=1983.9MB, alloc=513.0MB, time=28.66
memory used=2092.6MB, alloc=513.0MB, time=31.21
memory used=2216.3MB, alloc=641.0MB, time=32.89
memory used=2420.8MB, alloc=773.0MB, time=34.17
memory used=2573.0MB, alloc=773.0MB, time=36.93
> op(result):
> time[real]()-st;
                                         18.309

Darin

@Jarekkk When timing parallel code, you need to use time[real]() instead of time().  time() measures CPU time, which sums the time spend on each cpu when multiple CPUs are running.

Also, when running this example, even 10^5 is relatively short.  However using time[real] and 10^7, I see a decent speed up.  Also, notice that this example only creates two tasks, so the speed up won't be larger that 2 times.  To utilize more that 2 cpus you will need to sub-divide the for loops themselves.

> restart;
> p1 := proc(N)
>       local i, result;
>  
>       for i from 1 to N
>       do
>           result[i] := 2*i-1
>       end;
>  
>       op(result);
>  end:
>  
>  p2 := proc(N)
>       local i, result;
>  
>       for i from 1 to N
>       do
>           result[i] := 2*i;
>       end;
>  
>       op(result);
>  end:
>  
>  st:=time[real]():
>  Threads:-Task:-Start( passed, Task=[p1,10^7], Task=[p2,10^7] ):
memory used=11.4MB, alloc=37.1MB, time=0.18
memory used=25.3MB, alloc=53.1MB, time=0.41
memory used=52.7MB, alloc=85.1MB, time=0.91
memory used=88.7MB, alloc=125.1MB, time=1.40
memory used=189.3MB, alloc=253.9MB, time=3.31
memory used=331.4MB, alloc=381.9MB, time=5.09
memory used=470.8MB, alloc=381.9MB, time=6.81
memory used=610.5MB, alloc=512.0MB, time=8.61
memory used=757.0MB, alloc=768.0MB, time=12.52
memory used=1172.7MB, alloc=900.0MB, time=16.42
>  time[real]()-st;
                                         10.863

>
> restart;
> st := time[real]():
> for i to 10^7 do results[i] := 2*i-1 end do:
memory used=1307.9MB, alloc=18.0MB, time=19.21
memory used=1321.7MB, alloc=42.0MB, time=19.36
memory used=1344.0MB, alloc=66.0MB, time=19.65
memory used=1388.1MB, alloc=122.0MB, time=20.25
memory used=1447.3MB, alloc=253.7MB, time=20.81
memory used=1536.9MB, alloc=253.7MB, time=22.10
memory used=1627.1MB, alloc=382.7MB, time=23.39
memory used=1685.0MB, alloc=383.0MB, time=24.65
memory used=1807.8MB, alloc=511.0MB, time=25.96
> for i from 10^7+1 to 2*10^7 do results[i] := 2*i end do:
memory used=1983.9MB, alloc=513.0MB, time=28.66
memory used=2092.6MB, alloc=513.0MB, time=31.21
memory used=2216.3MB, alloc=641.0MB, time=32.89
memory used=2420.8MB, alloc=773.0MB, time=34.17
memory used=2573.0MB, alloc=773.0MB, time=36.93
> op(result):
> time[real]()-st;
                                         18.309

Darin

@samiyare 

In that case you'll have to do something like this:

> p1 := proc(i, r)
>     r + i;
> end:
>  
> p2 := proc(i, r)
>     r + i;
> end:
>
> result1[1] := 1:
> result2[1] := 2:
> for i from 2 to 10
> do
>     result1[i], result2[i] :=
>         Threads:-Task:-Start( passed, Task=[p1,i,result1[i-1]],
>                                         Task=[p2,i,result2[i-1]] ):
> end:
>
> print(result1);
table([1 = 1, 2 = 3, 3 = 6, 4 = 10, 5 = 15, 6 = 21, 7 = 28, 9 = 45, 8 = 36,

    10 = 55

    ])

> print(result2);
table([1 = 2, 2 = 4, 3 = 7, 4 = 11, 5 = 16, 6 = 22, 7 = 29, 9 = 46, 8 = 37,

    10 = 56

    ])

However, this is only going to be useful if the calculations in p1 and p2 are relatively expensive.  Starting and stopping tasks for each iteration of the loop is going to add some overhead.  

@samiyare 

In that case you'll have to do something like this:

> p1 := proc(i, r)
>     r + i;
> end:
>  
> p2 := proc(i, r)
>     r + i;
> end:
>
> result1[1] := 1:
> result2[1] := 2:
> for i from 2 to 10
> do
>     result1[i], result2[i] :=
>         Threads:-Task:-Start( passed, Task=[p1,i,result1[i-1]],
>                                         Task=[p2,i,result2[i-1]] ):
> end:
>
> print(result1);
table([1 = 1, 2 = 3, 3 = 6, 4 = 10, 5 = 15, 6 = 21, 7 = 28, 9 = 45, 8 = 36,

    10 = 55

    ])

> print(result2);
table([1 = 2, 2 = 4, 3 = 7, 4 = 11, 5 = 16, 6 = 22, 7 = 29, 9 = 46, 8 = 37,

    10 = 56

    ])

However, this is only going to be useful if the calculations in p1 and p2 are relatively expensive.  Starting and stopping tasks for each iteration of the loop is going to add some overhead.  

@marvin There should be a help page for Threads:-Task:-Return.

?Threads:-Task:-Return

brings up the help page in Maple 14 (for me).

Darin

p.s.  I seemed to stop getting notifications about comments on this post, so sorry for the late reply.

@erik10 In Maple, evalf is much more than a rounding function.  evalf[n](x) evaluates x using n digits of precision.    This is particularly useful when you increase n above the default of 10.  You can numerically evaluate the expression using 20, 50 or 1000 digits of precision.  In some sense, there is no "full precision" when using evalf, especially if your calculations include constants like Pi.  Try calling evalf[n](Pi) for various values of n, I think you'll begin to understand why evalf works the way it does.

You can evaluate your expression using evalf to compute the result, then call evalf[2] on the result to reduce it to 2 digits of precision.

Darin

@fwchapman That setup sounds fine.  The most important aspect is probably the processor and RAM.  The Core i7 performs better when executing parallel code than older processors.   Of course, more RAM is almost always better.  There are no limitations that are particular to your setup, but parallel programming in Maple does have some limitations, which I outlined in a previous blog post.

http://www.mapleprimes.com/maplesoftblog/36467-The-Current-Limitations-Of-Parallel

Can you tell I write more C code than Maple?  Thanks for pointing out the mistakes in my code.  I've updated the example.

Darin

-- Kernel Developer Maplesoft

The idea of racing algorithms is often an attractive one, but it has limitations.  In particular, there is usually a fixed number of algorithms that can be tried.  That inherently limits the scalability of this type of solution.  Eventually you'll have more cores than algorithms and then you want the algorithms themselves to be parallel.  As well, you often have some idea of which algorithms are more likely to find a solution, so it would be better for the algorithms themselves to be parallel and then run in the correct order, than running all the algorithms in parallel.

Darin

-- Kernel Developer Maplesoft

 

The reason we do this is that OS X sets a really low datalimit by default (that's where the 6144 number comes from).  This limit is too low for Maple to be run successfully.  So all this stuff is to work around that data limit.  It looks like the script is not smart enough when users have set another data limit.

Darin

-- Kernel Developer Maplesoft

http://www.elsevierdirect.com/companion.jsp?ISBN=9780123705914

It is listed on the page opposite the Preface.

Darin

-- Kernel Developer Maplesoft

Yep, that is basically the problem, and the solution that I came up with.  I think this example shows how easy it is to overlook the fact that variables are being shared between threads.  The solution, wrapping the Add'ed expression in a procedure, is also a nice way of fixing the problem.

Having the kernel automatically determine which variables are shared and creating thread local copies of those variables is possible.  It is sort of the step beyond simply determining if the given expression is thread safe to actually modifying the expression to make it thread safe. Parallelizing compilers are already doing things like this.

Allowing the user to explicitly list the shared variables so the kernel can replace them with thread local instances is quite possible.  However I'd like a consistant way of specifying additional arguments (there are other parameters that the paralleized vesions could accept)  that would work for all of Add, Map, Mul and Seq while keeping them compatible with add, map, mul and seq.  Compatibility with add, map, mul and seq is important because eventually these functions will be using the parallel implementations, and I'd like the users to be able to specify these types of arguments to them as well.  map is the real problem because it passes any additional arguments to the mapped function.

Darin

-- Kernel Developer Maplesoft

1 2 3 4 5 6 Page 3 of 6