dohashi

1202 Reputation

10 Badges

20 years, 88 days
I am a Senior Software Developer in the Kernel Group, working on the Maple language interpreter. I have been working at Maplesoft since 2001 on many aspects of the Kernel, however recently I have been focusing on enabling parallel programming in Maple. I have added various parallel programming tools to Maple, and have been trying to teaching parallel programming techniques to Maple programmers. I have a Master's degree in Mathematics (although really Computer Science) from the University of Waterloo. My research focused on Algorithm and Data Structure, Design and Analysis.

MaplePrimes Activity


These are replies submitted by dohashi

If someone does not find a correct solution for my example, I'll post one on Friday.

Darin

-- Kernel Developer Maplesoft

The first argument to add or Add (and mul, seq etc) is evaluated once for each value specified by the second argument.  The parallel versions of the these functions divide the values between multiple threads and the threads perform the evaluations in parallel.  In this case, the expression given to Add needs a value for k to perform the evaluation.  So there is definitely a speed up to be found using Threads:-Add in this case.

Darin

-- Kernel Developer Maplesoft

I'm glad that you were able to make the code faster.  It seems like the recursive task creation has a larger overhead than I was expecting.  That is definitely something for me to investigate.

As for the crash, unfortunately that is undoubtably a bug.  I will investigate that one too.

I notice that you are using

Xpre:=Vector(Xcdim,datatype=float[8]);
seq(assign('Xpre'[k], (X[i,k] + X[j,k])/2.0), k=1..Xcdim ) ;

this is probably faster

Xpre:=Vector(Xcdim,datatype=float[8], k->(X[i,k] + X[j,k])/2.0) ) ;
Darin

-- Kernel Developer Maplesoft

I think the big issue with this example is that the base case of 1000 is too small given how fast it is to compute the add.  Using a larger base case improves the performance.  That is effectively what you are doing by specifying the number tasks, however adjusting the base case size allows the code to remain independant of the number of processors.

Darin

-- Kernel Developer Maplesoft

I suspect that the new version could still be faster once compiled, although I have not tested that.

No, I don't think that breaking your code into smaller chunks would help speed up the example.  Reducing the total amount of memory used could help.  The current garbage collector can misbehave in parallel which leads to Maple allocating more memory than is probably necessary.  This can slow Maple down.  If you notice the single threaded code takes about 100Megs and that amount remains quite stable.  The parallel version starts at 100Megs and grows to about 800 Megs by the time it is done.

Now, the Task Programming Model works best when there are a large number of small(ish) tasks, but for your example, running on machines with 2 or 4 cores, it is probably not a big difference.

Darim

-- Kernel Developer Maplesoft

I took a look at this code the last time you submitted it.  The big problem is memory usage and garbage collection.  If you can reduce the memory used by the code, then it will probably parallelize better.

That said, there is still some room for general improvements.  I cleaned up your MakeV0A function and was able to speed it up a bit, both single and multi-threaded.

MakeV0A:=proc(i,j,Nij,X,sigma,Xcdim,Xrfull)
    local V0,den,k,l,pot,psi,tempdiffs,Xpre;

    den:=(2*sigma^2);

    Xpre := [ seq( (X[i,k] + X[j,k])/2.0, k=1..Xcdim ) ];
    tempdiffs := [ seq( add( (Xpre[k]-X[l,k])^2, k=1..Xcdim )/den, l=1..Xrfull) ];
    pot:=add( k*(exp(-k)), k in tempdiffs );
    psi:=add( exp(-k), k in tempdiffs );

    if  pot=0.0 and psi=0.0 then
        V0:=0.0;
    else
        V0 := Nij*pot/psi;
    end if;
    
    return V0;
end proc:

Also, I'm think there might be a bug in your original code, when you call MakeV0 in the single threaded case you pass 1 for istart and jstart, which means when you call MakeV0A for (i,j) you pass (i+1, j+1, N[i+1,j+1] ... ), is that what you wanted?

Darin

-- Kernel Developer Maplesoft

Please see my latest blog post for comments and answers to some of your questions.

Darin

I think if you take a look at the Mandelbrot example code, it is very similar to how you would implement your #2.  The Mandelbrot code accepts a Maple Matrix and fills it in.  If you matrix is triangular, I would suggest simply modifying the if statement that checks that the indicies are in bounds.

There will be some differences for windows, however it should not be too hard to for you to figure those out. 

As for double precision, only the most recent CUDA hardware (compute level 1.3 and higher) supports double precision, and it is slower that single precision, (although still faster than doing it on the host).

Darin

-- Kernel Developer Maplesoft

 

One thing I may need to point out is the difference between locking and blocking.  You can lock a structure without causing blocking.  Locking only causes blocking when two or more threads attempt to acquire the same lock.  If these sub matrixes are not shared between multiple threads, then your code will still lock when you access the table, but this won't cause blocking.  Now, there is some performance hit when locking in this case, because it is strictly not necessary, but currently the kernel needs to do this because it does not know if the rtable is shared with another thread.

Darin

-- Kernel Developer Maplesoft

 

I am planning on doing a blog post on GPU compuations in general.  I will definitely post a complete example then.

Do you think you are having trouble with the CUDA side or the external call side?

 

Darin

-- Kernel Developer Maplesoft

Unfortunately I can't really guess when any particular feature will be done.  One of the big problems is the Math Library programmers have way more code to deal with that we have in the kernel.  Even putting together the plan for how to start parallelizing the library is going to take some time.  I'll talk more about this when I do my "limitations" post.

As for Grid, it depends on how Grid works, unfortunately I'm not that familar with it.  If there is only one kernel running on the node computers, and these nodes have multiple cores, then parallel programming can be useful.  However if each node is running one kernel per core, then parallel programming on the nodes is probably not a good idea.

I have spent some time investigating and experimenting with CUDA and OpenCL.  They are very fast for a limited set of problems.  In particlar single precision data oriented parallel programming is where they really excel.  By "data oriented" I mean you want to do the same (or a very similar thing) to a large number of data points.  Numeric linear algebra is a typical example.  The latest generation of cards does support double precision, but it is slower.  Currently we don't have any built into Maple way of accessing GPUs, but you can connect to either of these APIs via external call.  I have written a test app that generates a Mandelbrot set using CUDA via Maple external call.

Darin

-- Kernel Developer Maplesoft

I have already been tasked with writing a parallel programming chapter for the Advanced Programming Guide.  These blog posts are definitely going to be helpful in that regard. 

This is another reason I'd like to encourage feedback.  Anything that I could improve with these posts will help improve the chapter.

As for the corporate site, we discussed that briefly before I started blogging.  I think that the corporate blog people wanted me to make the posts a bit more "corporate", which would cause me to spend more time writing and less time doing my actual job.

Darin

-- Kernel Developer Maplesoft

Thanks, these are good ideas.

I will post a blog about the current limitations of parallelism in Maple. 

Thread Safety is a tricky thing to describe, so maybe a post specific to how data can be shared in Maple is worth a post of its own as well.

Darin

-- Kernel Developer Maplesoft

GPU is an interesting topic. Is there a way to use it from Maple?

Well, like almost anything you can connect Maple to CUDA or OpenCL via external call.  However there is currently no built in support for accessing GPU hardware from Maple.  It is something we are investigating.

As far as I understand the current situation, with code in Maple language being 500-1000 slower than, say, in C, it doesn't have much sense to use parallel programming for Maple code other than for some worksheet effects

I would disagree with this assessment.  My main argument is described in the Why Go Parallel blog post.  If Maple does not go parallel, it won't show significant speeds up on new hardware. Now I am not claiming we have achieved this goal, but we have started taking steps toward that goal.  In addition, I think you may also be making a false assumption, that we could simply make Maple as a whole 500 to 1000 times faster.  Even doubling Maple's performance, in general, would take a significant amount of work.  However these kinds of speed ups are available from going parallel.  Getting anywhere close to C type performance requires compiling Maple code.  Having something like a JIT would be great, but it would also be a huge amount of work.

Darin

-- Kernel Developer Maplesoft

Unfortunately, the debugger does not work very well with threads or Tasks.  Fixing this is relatively high on our priorities list.

Currently the debugger does not support any explicit threading commands (listing threads, changing threads, etc).  However the debugger will work in any thread.  If one thread is stopped in the debugger other threads continue to run, unless they hit breakpoints as well.  When this happens, there can be multiple debugger sessions attached to multiple threads.  Debugging like this can be confusing as it can be hard to tell which DBG> prompt corresponds to which thread.

Similar rules apply to the Task Programming Model, with the added cavet that the call stack does not work properly.  We'd like the call stack for a Task to show its parent tasks, however that does not currently work.

Darin

-- Kernel Developer Maplesoft

1 2 3 4 5 6 Page 4 of 6