In my previous posts I have discussed various difficulties encountered when writing parallel algorithms. At the end of the last post I concluded that the best way to solve some of these problems is to introduce a higher level programming model. This blog post will discuss the Task Programming Model, the high level parallel programming model introduced in Maple 13.

Definitions

  • Task: a unit of executable code.

Introduction

Consider the following piece of Maple code

f := proc()
    fc( f1(args1), f2(args2), ... fn(argsn) );
end

To evaluate f, we first evaluate the fi's and compute their return values. These return values are then passed into fc as arguments. When fc completes, its return value is passed as the return value of f. The Task Programming Model takes this pattern, but creates tasks for the fi's and fc. So what is a task? A task is a piece of executable code. In Maple, a task is a procedure combined with a set of arguments to that procedure. Once a task is created, the Maple kernel can execute it. By allowing the kernel to schedule tasks, it can automatically distribute them to available cores.

In the example above, a function call, fi, can be replaced by a task, ti, in a fairly straight forward way, the procedure is fi and the arguments are argsi. However the task, tc, corresponding to the function call fc is a little more complex. The function call fc will not be executed until all the fi's have completed, thus supplying their return values to fc as arguments. Similarly the task tc must wait for values from the ti's before it can execute. tc's procedure is fc, and its arguments are the values passed from the other tasks. These other tasks are called tc's child tasks. Similarly tc is called the parent of the ti tasks. The task tc is called the continuation task.

In the example code above, the value returned by f is the return value of fc. Similarly, when a task creates a continuation task, the value returned by the continuation task is given to that task's parent. When this happens, the value returned by the task t is discarded. This may seem strange a first, however it is due to an important property of the Task Programming Model, that is, tasks always execute to completion. A task t, does not need to stop in the middle of its execution to wait for child tasks to complete, instead it can finish, and the continuation task will handle the child tasks' return values.  This is why tc is called the continuation task, it replaces t as a child of t's parent, it continues the computation started in t.  Of course if t does not create any tasks then its return value is given to its parent.

Here is a Task based implementation of the simple example we have been working with

(**) doSum := proc( A, first, last )
(**)     local i, l;
(**) 
(**)     l := 0;
(**)     for i from first to last
(**)     do
(**)         l := l+A[i];
(**)     end;
(**) 
(**)     l;
(**) end:
(**) 
(**) taskCont := proc( l, r )
(**)     l+r;
(**) end proc:
(**) 
(**) f := proc( A )
(**)     local n, n2;
(**)     n := rtable_num_elems( A );
(**)     n2 := floor( n/2 );
(**) 
(**)     Threads:-Task:-Continue( taskCont, Task=[doSum, A, 1, n2 ], 
(**)                                         Task=[doSum, A, n2+1, n ] );
(**) 
(**)     1;
(**) end:
(**) 
(**) A := LinearAlgebra:-RandomVector( 2000 ):
(**) Threads:-Task:-Start( f, A );
                                          1932

(**) print( add( A[i], i=1..2000 ) );
                                          1932 

There are a few interesting things to notice about this implementation. The return value of Threads:-Task:-Start is the final value, our previous examples used a global variable. Further, the value returned by Start is the value returned by the continuation task, taskCont, and not the value returned by the original task, f. As well, because this example does not use a global variable, it does not need a mutex. In the Task Model the continuation task, taskCont, will not be executed until its two child tasks are complete and have given taskCont their return values. This scheduling allows the problem to be solved without needing explicit synchronization tools.

Now, there are a few flaws in this implementation, for example, we only create two tasks.  Each of which performs a significant amount of work.  This limits our parallelism. I promised that using the Task Programming Model would help with scaling, so we need to modify this example slightly to achieve this result.  We'll use the ability of a task to create more tasks to improve this.


(**) taskCont := proc( l, r )
(**)     l + r;
(**) end:
(**) 
(**) task := proc( A, low, high )
(**)     local i, n, mid;
(**)     n := high-low;
(**) 
(**)     if ( n > 10000 ) then
(**)         mid := floor(n/2) + low;
(**) 
(**)         Threads:-Task:-Continue( taskCont, Task=[task, A, low, mid ], 
(**)                                         Task=[task, A, mid+1, high ] );
(**)     else
(**)         n := 0;
(**)         for i from low to high
(**)         do
(**)             n := n + A[i];
(**)         end do;
(**) 
(**)         n;
(**)     end;
(**) end:
(**) 
(**) n := 10^6:
(**) A := LinearAlgebra:-RandomVector( n, outputoptions=[datatype=integer[8]] ):
(**) 
(**) Threads:-Task:-Start( task, A, 1, n );
                                          4001

(**) add( A[i], i=1..n );
                                          4001

This example divides the input range until a base case is reached, creating a new Task for each sub-division. This creates a large number of relatively small tasks. The kernel is able to spread these tasks over the available cores. As long as there are a more tasks ready to execute than there are cores, the kernel will be able to keep all the cores busy. This allows algorithms written using the Task Programming Model to scale without requiring the programmer to worry about the number of available cores. The ability to scale can be illustrated by using the kernelopts( numcpus ) feature (note: if you have executed some code in the Task Programming Model, you must restart the kernel before setting kernelopts( numcpus )).

(**) n := 10^8:
(**) A := LinearAlgebra:-RandomVector( n, outputoptions=[datatype=integer[8]] ):
(**) kernelopts( numcpus );
                                            1

(**) time[real]( Threads:-Task:-Start( task, A, 1, n ) );
                                         56.714

... some code has been removed for brevity's sake, see the attached worksheet for the complete code ...

(**) kernelopts( numcpus );
                                            2

(**) time[real]( Threads:-Task:-Start( task, A, 1, n ) );
                                         34.182

 

(**) kernelopts( numcpus );
                                            3

(**) time[real]( Threads:-Task:-Start( task, A, 1, n ) );
                                         21.026

 

(**) kernelopts( numcpus );
                                            4

(**) time[real]( Threads:-Task:-Start( task, A, 1, n ) );
                                         19.685

As you can see, by simply increasing the number of cores available to Maple, the code gets faster. Thus even this relatively simple example is able to scale. Unfortunately the scaling from 3 to 4 cores (and beyond) is not very good. This is a limitation of the kernel. There is still more work to be done in the kernel to increase its parallelism and future releases of Maple should get better and better. Luckily, code written in the Task Model today should automatically take advantage of increased parallelism in the future.

It is always good to compare the multi-threaded version of the code to the fastest single threaded version. For Maple the best way to sum over an array would be to use add.

(**) n := 10^8:
(**) A := LinearAlgebra:-RandomVector( n, outputoptions=[datatype=integer[8]] ):
(**) time[real](add( A[i], i=1..n )); 
                                         27.646

Now, this is much faster than our Task code when running on one core, and even out performs our code on two cores, however at three cores, our simple library code is faster than this highly optimized kernel routine. In fact, we could further optimize our library code by using add instead of a for loop to compute the base case.

For more information on the Task Programming Model, see the following help pages

In my next post I will provide some more examples of programming using the Task Model.

Download a worksheet containing the examples from this post

Please Wait...