## Diminishing Returns from Parallel Processing:...

by: Maple 2016

This post is about the relationship between the number of processors used in parallel processing with the Threads package and the resultant real times and cpu times for a computation.

In the worksheet below, I perform the same computation using each possible number of processors on my machine, one thru eight. The computation is adding a list of 32 million pre-selected random integers. The real times and cpu times are collected from each run, and these are analyzed with a variety of metrics that I devised. Note that garbage-collection (gc) time is not an issue in the timings; as you can see below, the gc times are zero throughout.

My conclusion is that there are severely diminishing returns as the number of processors increases. There is a major benefit in going from one processor to two; there is a not-as-great-but-still-substantial benefit in going from two processors to four. But the real-time reduction in going from four processors to eight is very small compared to the substantial increase in resource consumption.

Please discuss the relevance of my six metrics, the soundness of my test technique, and how the presentation could be better. If you have a computer capable of running more than eight threads, please modify and run my worksheet on it.

Diminishing Returns from Parallel Processing: Is it worth using more than four processors with Threads?

Author: Carl J Love, 2016-July-30

Run the tests

restart:

kernelopts(numcpus= 1):
currentdir(kernelopts(homedir)):

memory used=0.79MiB, alloc change=0 bytes, cpu time=2.66s, real time=2.66s, gc time=0ns

memory used=0.78MiB, alloc change=0 bytes, cpu time=2.26s, real time=2.26s, gc time=0ns

Analyze the data

restart:

currentdir(kernelopts(homedir)):

(R,C):= 'Vector(kernelopts(numcpus))' \$ 2:
N:= Vector(kernelopts(numcpus), i-> i):

while not feof(fd) do
(n,Tr,Tc):= fscanf(fd, "%m%m%m\n")[];
(R[n],C[n]):= (Tr,Tc)
end do:

fclose(fd):

plot(
(V-> <N | 100*~V>)~([R /~ max(R), C /~ max(C)]),
title= "Raw timing data (normalized)",
legend= ["real", "CPU"],
labels= [`number of processors\n`, `%  of  max`],
labeldirections= [HORIZONTAL,VERTICAL],
view= [DEFAULT, 0..100]
);

The metrics:

R[1] /~ R /~ N:          Gain: The gain from parallelism expressed as a percentage of the theoretical maximum gain given the number of processors

C /~ R /~ N:               Evenness: How evenly the task is distributed among the processors

1 -~ C[1] /~ C:           Overhead: The percentage of extra resource consumption due to parallelism

R /~ R[1]:                   Reduction: The percentage reduction in real time

1 -~ R[2..] /~ R[..-2]:  Marginal Reduction: Percentage reduction in real time by using one more processor

C[2..] /~ C[..-2] -~ 1:  Marginal Consumption: Percentage increase in resource consumption by using one more processor

plot(
[
(V-> <N | 100*~V>)~([
R[1]/~R/~N,             #gain from parallelism
C/~R/~N,                #how evenly distributed
R/~R[1]                 #reduction
])[],
(V-> <N[2..] -~ .5 | 100*~V>)~([
1 -~ R[2..]/~R[..-2],   #marginal reduction rate
C[2..]/~C[..-2] -~ 1    #marginal consumption rate
])[]
],
legend= typeset~([
'r[1]/r/n',
'c/r/n',
'1 - c[1]/c',
'r/r[1]',
'1 - `Delta__%`(r)',
'`Delta__%`(c) - 1'
]),
linestyle= ["solid"\$4, "dash"\$2], thickness= 2,
title= "Efficiency metrics\n", titlefont= [HELVETICA,BOLD,16],
labels= [`number of processors\n`, `% change`], labelfont= [TIMES,ITALIC,14],
labeldirections= [HORIZONTAL,VERTICAL],
caption= "\nr = real time,  c = CPU time,  n = # of processors",
size= combinat:-fibonacci~([16,15]),
gridlines
);

## How to limit the number of processors used by Thre...

How can I limit the number of processors that are used by a multi-threaded operation? The help pages ?multithreaded and ?kernelopts both suggest, vaguely, that this can be controlled with kernelopts(numcpus). The following worksheet shows, however, that this doesn't work: While Maple does remember what you set numcpus to, this setting has no effect on the number processors used by Threads. And, if this doesn't work, what purpose is there in being able to set the value of numcpus?

 restart: First, warm-up and stretch the memory. Otherwise, the timings are invalid. L:= RandomTools:-Generate(list(integer, 2^18)): CodeTools:-Usage(Threads:-Mul(x, x= L)): memory used=1.21GiB, alloc change=0.98GiB, cpu time=18.66s, real time=3.75s, gc time=27.54m Now this is the actual test. L:= RandomTools:-Generate(list(integer, 2^18)): gc(); for n to kernelopts(numcpus) do      print(kernelopts(numcpus= n));      CodeTools:-Usage(Threads:-Mul(x, x= L), iterations= 4) end do: memory used=1.21GiB, alloc change=322.34MiB, cpu time=4.04s, real time=933.75ms, gc time=566.41ms memory used=1.21GiB, alloc change=-2.00MiB, cpu time=4.22s, real time=859.50ms, gc time=656.25ms memory used=1.21GiB, alloc change=-1.17MiB, cpu time=4.03s, real time=841.75ms, gc time=566.41ms memory used=1.21GiB, alloc change=256.00MiB, cpu time=4.47s, real time=911.25ms, gc time=738.28ms memory used=1.21GiB, alloc change=0 bytes, cpu time=4.33s, real time=852.50ms, gc time=605.47ms memory used=1.21GiB, alloc change=0 bytes, cpu time=4.34s, real time=853.50ms, gc time=628.91ms memory used=1.21GiB, alloc change=0 bytes, cpu time=4.27s, real time=826.75ms, gc time=535.16ms memory used=1.21GiB, alloc change=0 bytes, cpu time=4.19s, real time=841.50ms, gc time=308.59ms

You'll get equivalent results if you replace Mul by Add and increase the size of L to about 2^24.

I would like to add threads to this code that compiles cleanly:

RECTmatrixSEQ :=  (FD,m, n, xw, yw, tr, k, rho, dmax) ->

[seq
(seq
# A return of 0 means there is no choice - no digit possible
# result of a 4 means there is a choice of (-1 or 0)
(add(RECTpred(m, n, i, j, iter1, k, rho)*(2*abs(iter1)+1), iter1 = -dmax .. dmax)
+ (2*dmax+1)*(RECTpredR(m, n, i, j, dmax, k, rho)+
RECTpredL(m, n, i, j, -dmax, k, rho))+
.5*RECTpredOR(m, n, i, j, dmax, k, rho)+
.5*RECTpredOL(m, n, i, j, -dmax, k, rho),
i = -ceil(2*dmax+2*rho)*2^m .. ceil(2*dmax+2*rho)*2^m-1),
j = 0 .. 2^(n-yw)-1)]:

RECTmatrixSEQ :=  (FD,m, n, xw, yw, tr, k, rho, dmax) ->

(seq
# A return of 0 means there is no choice - no digit possible
# result of a 4 means there is a choice of (-1 or 0)
(add(RECTpred(m, n, i, j, iter1, k, rho)*(2*abs(iter1)+1), iter1 = -dmax .. dmax)
+ (2*dmax+1)*(RECTpredR(m, n, i, j, dmax, k, rho) +
RECTpredL(m, n, i, j, -dmax, k, rho))+
.5*RECTpredOR(m, n, i, j, dmax, k, rho)+
.5*RECTpredOL(m, n, i, j, -dmax, k, rho),
i = -ceil(2*dmax+2*rho)*2^m .. ceil(2*dmax+2*rho)*2^m-1),
j = 0 .. 2^(n-yw)-1)) ] )
):

I do get a runtime error:

"PROC:feasibilitycheckproONCE

"
Error, (in CodeTools:-Usage) invalid input: too many and/or wrong type of arguments passed to Threads:-Create; first unused argument is j = 0 .. 3

Any help spotting this problem would be appreciated. Thanks in advance.

Bonnie

I am (again) trying to get Maple to do some parallel work; using the Threads package.

My actual problem involves a vector function with 6 elements, acting on 6-vectors. Each element of the output vector is calculated as a high-order polynomial of all 6 input elements. The whole thing is a map that I want to iterate. I plan to evaluate each of the 6 functions in a separate thread (the input vector of course is the same for all six) and then put the results together in a Vector, to be used as input for the next iteration.

Facing difficulty I finally wrote myself a little toy program to check out the basic mechanism. Here it is:

restart;
f:=x -> 1+x^2;
2
f := x -> 1 + x
x:=0:
tt:=time[real]():

for i from 1 to 10 do
id:=Create(f(x),y):
Wait(id):
x:=y:
y:='y':
end do:

time[real]()-tt;
0.018
i;
11
So far so good; even the output (not shown) makes sense. BUT: as I increase the number of iterations in the for...do loop, the memory allocation goes up fast, and I hit a point at about 40 iterations where the whole process locks up and the program never ends, cannot even be stopped (at least for 100 iterations), forcing me to abort the whole thing. I have evidence that Maple allocates vast amounts of memory which finally chokes the whole thing (on a 16 GB-RAM machine).

Anyone have any idea what I am doing wrong? I realize the above example does not provide benefits; in the real example there will be 6 Creates and the loop will Wait for all of these to finish.

I'd really like to get this to work as each function can take quite some time and I do expect at least some speed-up from parallelizing this (even after overhead).

This is on Maple 2015 on Mac OS X 10.10.5 with 16 GB of RAM. I should mention that I set UseHardwareFlots:=true in my .mapleinit file.

Thanks,

M.D.

I'm running calculations like this:

N:=10000;
f := (i,j)-> (some complicated procedure depending on i and j);

I have a server with 20 cores, but each core has two threads, so this code should max out all 40 threads. But what I notice is only at most 20 threads being used at a time.

I checked kernelopts(numcpus) returns 20.

Does anyone have any advice on how to maximize my resource usage?

Hi!I am running some grid computations and I found that I could speed up my computations if my nodes could share some partial results. It seems that Grid:-Send and Grid:-Receive is almost perfect. The problem is that Grid:-Receive blocks the computations where I cannot ensure that a message will ever be send. I've tried to run Grid:-Receive on a thread in an possible infinite loop but it still blocks.

Some pieces of code below:

SignatureReciever:=proc(SigContainer::uneval,stopValue::uneval)
while eval(stopValue) do
end do;
end proc:

SendMessage:=proc(message::string, id::integer)
Threads[Seq](proc(i) if i <> id then Grid:-Send(i,message) end if end proc,i=0..Grid:-NumNodes()-1);
end proc:

In the main procedure run on a node I have

....

....

SendMessage(signature,id);

....

....

stopValue := false;

Do you have any suggestions how to solve my problem?

Hi!

I have a problem recently with running RootFinding:-Isolate inside a procedure which is run on threads. I have many polynomial systems to treat so I would like to balance computations between cores. Unfortunately, my code always breaks computation kernel with the error:

java.lang.NullPointerException

at com.maplesoft.mathdoc.model.WmiModelLock.acquireLock(Unknown Source)
at com.maplesoft.mathdoc.model.WmiModelLock.writeLock(Unknown Source)

I wrote a small example below

f:=proc(n)

local i;

for i from 1 to n do
try

RootFinding:-Isolate([a^2+b^2-1/3c^2,a^2-1/3b^2+c^2-1/3,a^2-1/3b^2-1/3c^2+1],[a,b,c]):

catch "":
next:
end try:
end do:
end proc:

I have no much of experiences with threads in Maple so I would not be surprise if I'm trying to do something completely wrong here. Any hint or advice will be appreciate.

## Matrix and parallel computing / multicore...

Hi,

Is there any difference between

Matrix(4,5,(K,C)->K+C);

and

Array(1..4,1..5,(K,C)->K+C);

Say if I have a very 'complicated' procedure myfunc(K,C) that takes two options, but it runs all dependently.

Matrix(4,5,(K,C)->myfunc(K,C));

and

Array(1..4,1..5,(K,C)->myfunc(K,C));

Which one is more efficient? The final ouput of each run from myfunc is just a integer value.

The reason I am asking it that i think both runs on a 'single' thread (core) as CPU usage is always very low, around 15-20%.

If I look at the task manager, some cores (threads) arent doing anything.

Is there anyway to speed things up?

Thanks,

Hello, everyone. I have some problem with multithreaded calculation. I just need calculate eigenvalues of matrix m at various parameters (and then export to a file) using advantages of the parallelizing. The following code works but in serial way

restart: with(LinearAlgebra):

m:=ImportMatrix(cat(currentdir(),"m.txt")): # here is matrix m.txt

step:=1:

prc:=proc(k1,k2)

local u,i,j,nmc:

u:=Matrix(ceil(1+(k2-k1)/step),5):

u[1,1]:=k1:

for i from 2 to op([1,1],u) do

u[i,1]:=u[i-1,1]+step:

end do:

for i from 1 to op([1,1],u) do

nmc:=sort(Eigenvalues(m*u[i,1], output='list')):

for j from 2 to op([1,2],u) do

u[i,j]:=nmc[j-1]:

end do:

writedata[APPEND](cat(currentdir(),"u_",convert(k2,string),".txt"), [convert(Re(u[i]),list)]):

print(u[i,1]);

end do:

return finished:

end proc:

quit:

The Start(ArrayTools[Concatenate], 2, Task=[prc,1,20], Task=[prc,20+step,40]) function makes two tasks of calculation at the parameter ranges of 1-20 and 21-40. But in this case Start spends twice more time than simply prc(20+step,40). How to realize a multithreaded calculation?

By the way I don't need to use a Concatenate function in Start but without any procedure Start doesn't work.

## Is it really better? with(Threads)...

Hi all,

I just learned about the with(Threads) package and I wanted to do a bit testing and see if it's 'faster' or more 'efficient'.

Test.pdf

This was done on a i7 940X (QuardCore) laptop.

I dont really see how it is better in a sense?

Or am I using Seq for the wrong purpose?

Casper

## Parallel in Loop...

Hi,

I want to parallel the following code.it is good that part1 solved by one thread and part2 solved by other thread.

dtheta[m](x):= theta[m-1](x)+phi[m-1](x);
dphi[m](x):= theta[m-1](x)-2*phi[m-1](x);

for i from 2 by 1 to 10 do;

# part1

theta[i](x):=subs(m=i,dtheta[m](x)):
theta[i](x):=value(%):
eval(-theta[i](x),x=0);

## parallel code with maple...

hi, i want to write the following algorithm in maple with two threads, please help me. for m from 1 by 1 to 100 if thread1 then result1:= some equation if thread2 then result2:= some equation end if end do:

Hi everyone,

How I can split loops than in openMP library

#pragma omp for

## Sleep Sort

by: Maple 15

Sleep Sort is a hilarious (to me anyway) joke dressed up as a sorting algorithm.

Here it is in non-obfuscated (if somewhat garbagey) Maple code (need version 15 since it uses ?Threads,Sleep )

```SleepSort := proc(L::list(posint),\$)
local Lout, p, i;
Lout := NULL;

p := proc(n::posint,\$)