dohashi

1202 Reputation

10 Badges

20 years, 88 days
I am a Senior Software Developer in the Kernel Group, working on the Maple language interpreter. I have been working at Maplesoft since 2001 on many aspects of the Kernel, however recently I have been focusing on enabling parallel programming in Maple. I have added various parallel programming tools to Maple, and have been trying to teaching parallel programming techniques to Maple programmers. I have a Master's degree in Mathematics (although really Computer Science) from the University of Waterloo. My research focused on Algorithm and Data Structure, Design and Analysis.

MaplePrimes Activity


These are replies submitted by dohashi

@Alejandro Jakubi 

No, the point is whether these distributed grid mode computations can be done within a single machine by means of the Grid package alone, without the Grid Computing Toolbox.

Just to be clear, all the functions in the Grid package work without the Grid Computing Toolbox.  The only thing I can think of that is different is that Grid:-Setup only supports the local option, however that is clearly documented.

Without the Grid Computing Toolbox you can share computations across a grid of local nodes (aka nodes on the current machine, created and managed by Maple).

With the Grid Computing Toolbox you can share computations across a grid of distributed nodes (aka nodes on other computers created and managed by a grid control system).

There are certain functions that are more useful in one context or the other, but all functions work in both cases.  There is lots of room to improve the documentation, however I don't think we document anything that is unsupported in local Grid without making it clear.

Darin

 

 

@Alejandro Jakubi 

"The Grid package is the api for both local and distributed Grid", meaning that the distributed Grid mode may be used without the Grid Toolbox

That is not what I meant.  Both local and distributed grid use the functions in the Grid package to create Grid based computations.  With the Grid Computing Toolbox you can access externally managed, distributed nodes.  Without the Grid Computing Toolbox, you can only access local nodes, managed by Maple.

Darin

@Alejandro Jakubi For the local grid, the controller node (the main kernel) does very little work, so Maple starts N nodes, thus if you have 8 cpus, Grid will create 8 new grid nodes.  The controller node does not take part in the grid computation, so it does not interfere with the grid nodes that are actually executing.

The distributed grid is only available with the Grid Computing Toolbox.  Only local grid is available with Maple by default.  The available Grid commands are the same regardless if you have Grid Computing Toolbox or not, the difference is whether Maple manages nodes locally itself or attaches to an externally managed distributed grid.

hpc and mpi are common communication protocols for controlling a distributed grid of computers.  With the Grid Computing Toolbox, Maple can connect to and launch commands on grids managed using these protocols.  To do this Maple needs to know what protocol the grid is using, thus in Grid:-Setup you can set what type of grid you have.

Sounds like we do need to improve the documentation for users unfamaliar with grid computation.

Darin

@Alejandro Jakubi Maple includes what we refer to as Local Grid.  Local Grid can start and manage nodes on your local machine.  As an add on, you can purchase the Grid Computing Toolbox that allows Maple to run on distributed grids, that is, multiple networked machines.

With the Local Grid, the kernel that you get when you start Maple acts as a controller for the Local Grid, so it can start nodes itself.  That kernel is outside the grid, and can create the grid (by launching kernels on the local machine) when they are needed.

The Grid package is the api for both local and distributed Grid.  However when programming for the distributed Grid, functions like Launch, Send and Receive become much more important.  Copying a complete state to 4 or 8 local nodes may make sense, but it does not really work if you have 400 nodes over 100 distributed machines.  In that case the programmer needs to be careful that data is only shared when needed.

Darin

@Carl Love

The paragraph is a little confusing.  The "called outside the grid, on a non-compute node, not as part of a parallel computation" is basically defining two cases, one where the kernel calling Map is outside of the Grid, or the kernel calling Map is part of the Grid.  If the kernel is outside of the grid, then Map calls Launch to create nodes to compute the Map command and the result is returned to the calling kernel.  This is the case that most users want to use.  This is also the case shown in the first few examples.  The second case is intended to be used with the Grid Toolbox (although it does work without it) where there are kernels running on multiple machines managed by a system external to Maple.  What this case does is really just for sophisticated users who are writing complex Grid code.  The last example shows this case, although it may not do a very good job of illustrating the difference.

Darin

@itsme The anonymous procedure thing definitely looks like a bug.  As for the other issue, that could be a problem with value sharing.  Since the Grid nodes are separate kernels any values needed for the computation on a particular node needs to be shared from the main kernel.  Map is a simplified interface that tries to automatically detect and share required values, but it may not be robust enough for your computation.

Take a look at the 4th paragraph on the Grid:-Map help page.

If it is important enough I'd suggest taking a look at the full Grid api. However then you are responsible for making sure that all the required data is shared.

Darin

@itsme Can you try executing the following bit of code and report back what is printed?  The f here just loops for 5 seconds then quits.  You should see a time reported of around 5 seconds while i <= NumNodes and then 10 seconds for NumNode < i <= 2*NumNodes.

f := proc( x )
local t;
t := time[real]();

# busy wait to burn cpu
while ( time[real]()-t < 5 ) do end;

x;
end;

n := Grid:-NumNodes();

for i from 1 to 2*n
do
t := time[real]():
L := [ $1..i ]:
Grid:-Map( f, L );
print( i, time[real]() - t );
end:

Another question, are the elements of the list very large?  If you are willing to post your actual example, that would be the quickest way for me to help.

Darin

I have a few questions before I know what to suggest:

1.  How many elements in your list?

2.  Does the function you are mapping take approximately the same amount of time per element in the list?

Darin

-- Kernel Developer Maplesoft

@Markiyan Hirnyk From a quick reading of that thread, it appears as though he needs to use a procedure which has not been verified as thread-safe, and so it may not work properly when run in parallel.  The decision to use Grid is probably the best solution for now.

@Markiyan Hirnyk There is not really anything new.  This page is really just to have a single location that lists the articles.  In particular I often tell people to go look at these blog posts I made, but I didn't really have a nice link to send people to.

That said, I am thinking about working on some more content for the near future.

I've updated my code for solving this example, you can see the new code here.  This change fixes a few memory issues that means the new code can actually run large examples.  Here is a (trivial) example that I ran to make sure the code could sucessfully make it through all the combinations.

> CodeTools:-Usage( FilterCombPara( 113,7, x->false ) );
memory used=17.39TiB, alloc change=460.02MiB, cpu time=88.52h, real time=20.15h
                                                  []

So it filtered all of C(113,7) in 20 hours of real time (88 CPU hours on a 8 core machine).  It used 17 terabytes of memory, but only had to allocate 500M.  With a "real" filter function, this will take longer (potentially much longer depending on the filter's complextity), but we are able filter all combinations in memory.  If the number of matches is small then it should be possible to hold all them all in memory too.  If the number of matches is too large to fit in memory, then it will probably be necessary to write them to disk.  (Which is not something this code does).

Anyway try out the new code, I'd love to know if it works for you, and if it doesn't, I'd like to know what's not working.  These huge examples are a great way to stress test Maple.

Darin

 

@Carl Love Making this code to write the matching combinations to disk is a relatively simple modifitcation.  In filterComb instead of collecting and returning the matching combinations we can write them to disk or add them to a buffer that is periodically written to disk.  Whether or not the writing to disk is going to cause problems with parallelism will depend on what fraction of combinations are matches.  If the number of matches is small (relative to the total number of combinations) it shouldn't be too big of a deal.  Otherwise writing is going to be the bottleneck and parallelizing the search may not actually help much.  Of course, perhaps the OP could invest in a SSD that would significantly speed up the writing or act as a fast virutal disk.

@Carl Love The Threads:-Seq, etc functions use a heuristic where they attempts to create n^(1/3) tasks.  However this is not necessarily the best thing for your code to do.  The Threads functions need to work decently in all possible situations, so it needs a heuristic that handles odd cases that you won't need to worry about.  For your own code, you have a better understanding of how the timing will work out, so you can probably do better by picking the "right" base case size for code.

Darin

For optimal performace, you want to use as small a base case as possible.  However you need to be careful as the overhead of creating lots of new tasks will become a problem.  Ideally this overhead would not be significant issue, but it still is.

The reason you want as many tasks as possible is that it provides both load balancing and scaling.  From a load balancing perspective, if you have a small number of tasks, one task could take much longer than the others to complete.  When this happens, one core will be executing the long task while the other cores are idle, having finished their tasks.  With lots of tasks, the other cores will be able to do more work while the one core is dealing with the long task.  The fastest your Task Model execution can ever be is the length of the longest task, so keeping it small is a good thing.  From a scaling perspective, lots of tasks can be spread across lots of processors, without needing to know how many processors there are.  Yes, you can use numcpus, but its not necessary.

Darin

-- Kernel Developer Maplesoft

@TomJohnRiddle Hmmm... You said that while you are waiting, the CPU utilization is basically zero, so you're not seeing Maple's kernel (mserver process) running at 100%, correct?  If you were seeing the kernel running during that time, I would suspect that there was some seriously inefficent algorithm being used in the kernel.  However if you are not seeing that, and it is not a latency problem of trying to pull each element across the network one at a time, then I'm not sure what else it could be.

What version of MS SQL server are you using and what is the type of data stored in the table?

Thanks

Darin

1 2 3 4 5 6 Page 2 of 6