In my previous posts I discussed the basic difference between parallel programming and single threaded programming. I also showed how controlling access to shared variables can be used to solve some of those problems. For this post, I am going to discuss more difficulties of writing good parallel algorithms.

Here are some definitions used in this post:

  • scale: the ability of a program to get faster as more cores are available
  • load balancing: how effectively work is distributed over the available cores
  • coarse grained parallelism: parallelizing routines at a high level
  • fine grained parallelism: parallelizing routines at a low level

Consider the following example

(**) work := proc(n) # do some O(n) "work"
(**)     local i;
(**)     for i from 1 to n
(**)     do
(**)     end do;
(**)     n;
(**) end:
(**) 
(**) p := proc( A, low, high, m )
(**)     local i, count;
(**)     global shared_variable;
(**) 
(**)     count := 0;
(**) 
(**)     for i from low to high
(**)     do
(**)         count := count + work(A[i]);
(**)     end do;
(**) 
(**)     Threads:-Mutex:-Lock( m );
(**)     shared_variable := shared_variable + count;
(**)     Threads:-Mutex:-Unlock( m );
(**) end proc:
(**) 
(**) N := 100000000: # the total amount of work
(**) M := 100:
(**) n := N/M:
(**) A := [ seq( M, i=1..n ) ]: # divide the work evenly over each entry
(**) m := Threads:-Mutex:-Create():
(**) shared_variable := 0:
(**) 
(**) t:=time[real]():
(**) p( A, 1, nops(A), m ):
(**) t := time[real]()-t:
(**) 
(**) print( t ):
                                         12.982

(**) print( shared_variable ):
                                        100000000 

(**) 
(**) shared_variable := 0:
(**) n := nops(A):
(**) n2 := floor( n/2 );
                                      n2 := 500000 

(**) t:=time[real]():
(**) Threads:-Wait( op( map( Threads:-Create, [ 'p( A, 1, n2, m )', 'p( A, n2+1, n, m )' ] ) ) ):
(**) t := time[real]()-t:
(**) 
(**) print( t );
                                          7.123

(**) print( shared_variable ):
                                        100000000

In this example, each entry in the Array determines how much time is spent computing the "result" and, in this case, each entry takes the same amount of time to compute. By dividing this Array half and giving each half to a thread, we get a relatively even division of work. This results in the problem being solved in almost half the time, which is exactly what we'd like to see.

However what if the work required to compute entries is not evenly distributed?

(**) 
(**) N2 := N/2:
(**) n := N2/M:
(**) A := [ N2, seq( M, i=1..n ) ]: # first element is N/2, the remaining elements sum to N/2
(**) shared_variable := 0:
(**) n := nops(A):
(**) 
(**) t:=time[real]():
(**) p( A, 1, n, m ):
(**) t := time[real]()-t:
(**) 
(**) print( t );
                                         12.333

(**) print( shared_variable ):
                                        100000000

(**) 
(**) shared_variable := 0:
(**) n2 := floor( n/2 );
                                      n2 := 250000

(**) t:=time[real]():
(**) Threads:-Wait( op( map( Threads:-Create, [ 'p( A, 1, n2, m )', 'p( A, n2+1, n, m )' ] ) ) ):
(**) t := time[real]()-t:
(**) 
(**) print( t );
                                         10.254

(**) print( shared_variable ):
                                        100000000


In this example, the first element of the array is half the total amount of work. Now when we divide the Array in half, one half has far more work to do that the other. Thus one thread will finish quickly, while the other will run for much longer. In this example, the time required for the threaded example is longer than we would like.

We could look at the values in the Array and attempt to pick a better way to partition it, however it is entirely possible that doing this would take just a long as executing the function directly. It is also possible that we may not be able to predict how long the problem will take by simply looking at the inputs. Figuring out how to keep the all the processors busy is called load balancing. Having good load balancing is an essential part of ensuring a parallel algorithm will scale. Unfortunately writing a good load balancing algorithm can be tricky and just about every every scalable parallel algorithm will need some sort of load balancing.

Let's imagine we are writing a parallel algorithm in Maple. How many threads should it start? If we start less threads that cores, then clearly we aren't using the full power of the computer. We don't want to start more threads than cores, along with fighting for execution time on a core, each thread consumes other resources as well. Assuming we want our algorithm to run as quickly as possible, one thread per core is the appropriate number. However our algorithm does not live in a vacuum. It may call library code. What happens if one of the library functions has been parallelized and wants to start threads too? What happens if our code is part of a library that might be called as part of another parallel algorithm? Clearly we need some mechanism for managing the number of threads used.

One might think that a simple rule like, high level algorithms should be parallel (called coarse grained parallelism) and low level routines should be linear, could work. However consider our example of poor load balancing. Even if each entry in the array was given its own core, the speed up would be, at best, 50%. The one large entry would always take n/2 time. What about the opposite approach? Parallelize the low level algorithms (fine grained parallelism), but leave the high level code linear. The problem here is illustrated by the example with evenly distributed work. If each of the individual problems is small and not easily parallelized, then we won't see any significant speed ups in that case either. The solution is that we must have both fine grained and coarse grained parallelism. However to achieve this, we must have a good thread management strategy and it must be applied to all the parallel routines in a library.

Both the load balancing and thread management issues must be solved for good, scalable, parallel algorithms. For a system like Maple, with its large library, it is not reasonable to expect the library to be parallelized if each routine must solve these problems for themselves. Therefore, it makes sense to consider using a higher level parallel programming model. In my next post, I will describe the Task Programming Model, the high level parallel programming model we added in Maple 13.

Download a worksheet containing these examples.

Please Wait...