marvin

21 Reputation

6 Badges

15 years, 254 days

MaplePrimes Activity


These are replies submitted by marvin

@pagan Sorry, I am clearly not being clear.  I could be having trouble with the MKL libraries if I was doing the problem of matrix matrix multiplication for float[8] matrices.  In testing Maple I wan't doing that.  In fact, I did similar kinds of problems to construct a matrix, but nothing that called the MKL library.  Moreover, in many examples it only did integer operations.  Nevertheless I never saw a useful speedup for large examples.  MKL and OMP_NUM_THREADS is a red-herring.  I shouldn't have gone there in my remarks.  The fact is that for my real world kind of problems being able to run 4 threads never did better than 1 thread.  Roman Pierce promised to explain it all to me and show me how to do it correctly, but he never did.  All that happened is that he came up with more examples that didn't work.

Hi,

Actually I regularly use the MKL libraries to speed thsings up by, on a 4 core machine, setting OMP_NUM_THREADS to 3 or 4.  In doing the tests I am speaking about I set it to 1.  I am well aware of the fixes you suggest.  If matrix-matrix multiplication were the real problem I am interested in I would do what you suggest.  Unfortunately I have a slightly different problem where I have to construct a large matrix using compiled Maple code.  This gets me into the problem of getting Maple to let me parallelize that code, since it assumes it isn't thread safe.  Eventually I will have to externally compile my code and then link it to Maple, but frankly for what I am doing now that is a pain in the....  So, I first want to find out if there will be a win using parallel tasks in Maple.  So far as I can tell, there is no win at all.  I fear that the problem has to do with Maple's overhead and garbage collection, not with my example.

 

I have already posted to your blog that there is little to no speedup for what I would have thought would be a very parallelizeable problem, Multiplying a list of vectors by a matrix.  This is true both on a 64 bit quad core machine running XP professional and a quad core machine running Windows 7.  Both machines have 8GB of ram.  I really haven't gotten a satisfactory answer as to why this doesn't work.  It obviously works great on a CUDA card which has, of course, more processors, but which handles many many threads.

I would really love to have a better understanding of these issues.

 

Marvin Weinstein
SLAC Linear Accelerator Center

By the way, how come there is no documentation for Threads:-Task:-Return ?

I would love to see more details on External Calling.  What I really want is to be able to write programs that execute on a GPU and then call them from Maple, with or without having them called in parallel.

I have been playing a lot with the multi-threaded stuff in the Maple14 beta and find, dealing with matrices.  Darren already has some of the examples.  The problem I see is no real speed-up no matter what I try.  Or, Maple uses up tons of memory and crashes.  Darren has remarked that the problem might be with Maple's garbage collector, and I assume he know what he is talking about.  Since he is looking for topics I would love to understand this issue better (I am a physicist not a computer scientist) in order to know if I should just stop trying to use this stuff, or if it is worth continuing to poke at it.  The little bit of CUDA support that we will have is, of course, a much bigger win.  I would love a lot more of LInalg support. 

If I get to vote, my preference would be for Maple to fix the problems with the multi-threaded code and to get good support for compiled CUDA (or whatever) programs for those of us with GPU's.  Since a lot of what I do is linear algebra (double precision numerics) being to offload matrix multiplies and the construction of big matrices to the GPU would be a **big** win.  Obviously, being compiler and linker averse (not incapable -- just averse) being able to compile Maple code to CUDA code (even if I have to be careful about writing the code as I do now with Compiler:-Compile() ) and then run it from Maple (using define_external) would be heaven. 

I know having a wonderful general solution for adding high-level language support would be great; however, getting this basic level of support into the next beta would solve a lot of my problems right away.

Actually I know that, I just didn't bother making any changes.  The increase is small. 

Are you sure it is the recursive task creation?  It could just be that the small threads all refer to the same chunk of memory but don't allocate any big chunks of memory on their own, they just return a single number.  This is a lot more like the CUDA code that I was trying to emulate.  I would really love to know what the real cause of the slowdown is.

Also, getting rid of the bug would be really nice ;^)

If you pass MakeV0A a matrix M that is all 0's and change the last line to

M[i,j]:=V0;

end proc;

Then run the program it does exactly the correct thing the first time.  Follow that by garbage collection.  Redefine M to be 0 again and run the same set of tasks and Maple crashes.  It loses connection with the kernel.  What's up with that?

So the following code is running on my laptop with only a 2 core setup.  I will try it at home later with 4 cores.  The speedup improves with size, however I run out of memory very quickly when the rows and columns get up to 100.  That seems to be the garbage collection.

restart;


N:=0.05*RandomMatrix(300,300):
N:=0.5*(N+HermitianTranspose(N)):
XX:=0.03*RandomMatrix(3000,18):

MakeV0A:=proc(i,j,Nij,X,Xcdim,Xrfull)
    local V0,den,k,l,pot,psi,tempdiffs,Xpre;

    den:=(2);
    Xpre:=Vector(Xcdim,datatype=float[8]);
    tempdiffs:=Vector(Xrfull,datatype=float[8]);

    seq(assign('Xpre'[k], (X[i,k] + X[j,k])/2.0), k=1..Xcdim ) ;
    seq(assign('tempdiffs'[l], add( (Xpre[k]-X[l,k])^2, k=1..Xcdim )/den), l=1..Xrfull) ;
      pot:=add( k*(exp(-k)), k in tempdiffs );
    psi:=add( exp(-k), k in tempdiffs );

    if  pot=0.0 and psi=0.0 then
        V0:=0.0;
    else
        V0 := Nij*pot/psi;
    end if;

    return [i,j,V0];
end proc:
MakeV0B:=proc(i,j,Nij,X,Xcdim,Xrfull)
    local V0,den,k,l,pot,psi,tempdiffs,Xpre;

    den:=(2);
    Xpre:=Vector(Xcdim,datatype=float[8]);
    tempdiffs:=Vector(Xrfull,datatype=float[8]);

    seq(assign('Xpre'[k], (X[i,k] + X[j,k])/2.0), k=1..Xcdim ) ;
    seq(assign('tempdiffs'[l], add( (Xpre[k]-X[l,k])^2, k=1..Xcdim )/den), l=1..Xrfull) ;
      pot:=add( k*(exp(-k)), k in tempdiffs );
    psi:=add( exp(-k), k in tempdiffs );

    if  pot=0.0 and psi=0.0 then
        V0:=0.0;
    else
        V0 := Nij*pot/psi;
    end if;

    return V0;
end proc:


fillM:=proc(a)
return a,_rest;
end proc;
proc(a)  ...  end;
 

foo:=seq(seq(Task=[MakeV0A,i,j,N[i,j],XX,18,10],j=1..70),i=1..70):

st:=time[real]():Matrix(70,70,(i,j)->MakeV0B(i,j,N[i,j],XX,18,10),datatype=float[8]);time[real]()-st;
                           Matrix(%id = 959381748)
                                   53.941
task:=proc(N,XX)
local W;
W:=Task:-Continue(fillM,foo);
return W;
end proc:
gc();
st:=time[real]():Vtask:=[Task:-Start(task,N,XX)]:time[real]()-st;
                                   46.191
gc();nops(Vtask);
                                    3600

 

 

 

You are correct that the original code had a bug, but the new code fixes it.  Both the tasks and the single threaded example construct exactly the same matrix now.

As to your comments:

First: Thanks for the speedup suggestions, I am always interesting in learning from an expert; although, it isn't all that important in this case since eventually this has to be in compiled code. 

Second:  Are you saying that if I broke the process up into smaller chunks would it run faster?  I understand how that would work in the CUDA case since each computation could live in local memory, why would it matter in this case?  I would love an explanation of how memory useage impact parallelization.

Once again, thanks for the response

Marvin

 

 

I ran it on my machine at home.  Quad core, 8 GB of ram, 64bit machine running Windows 7.

181 sec single-threaded.

345 sec running 3 threads.

Consistently bad performance.  This was the only program using the CPU.  It could be that the intel math libraries are contending with the big threads, but I would be surprised if that changed a factor of 4 faster into a factor of 2 slower.

Marvin

 

 

So I have constructed this example.  Naively I would have thought since constructing the big matrix takes N^2 operations and each small matrix (N/2)^2 operations, if three threads run at once I would get a big speedup.  Instead it is much slower. 

My machine is X64 (Windows XP) quad core and X86 (Windows 7) 2 core.  The overall result is similar.  What is happening here?

> restart;

> with(LinearAlgebra);with(Threads);
> Digits:=14:
> UseHardwareFloats:=true:
> # some utility programs
>
> Ncols:=proc(A)
> op(2,ArrayDims(A)[2]);
> end proc:
>
> Nrows:=proc(A)
> op(2,ArrayDims(A)[1]);
> end proc:

> kernelopts(numcpus);
                                      2
> kernelopts(numactivethreads);
                                      1
> MakeV0:=proc(nrows::integer,ncols::integer,istart::integer,jstart::integer,N::Matrix(datatype=float[8]), X::Matrix(datatype=float[8]),nPotbasis::integer, sigma::float)
> local foo,Xrdim,Xcdim,Xrfull,ioffset,joffset;
> global MakeV0C;
>
> Xrdim:=nrows;
> Xcdim:=ColumnDimension(X);
> Xrfull :=nPotbasis;
>
> ioffset:=istart;
> joffset:=jstart;
>
> print("numthreads executing = ",kernelopts(numactivethreads));

> if istart=jstart then
>    foo:=Matrix(nrows,ncols,(i,j)->MakeV0A(i+ioffset,j+joffset,N[i+ioffset,j+joffset],X,sigma,Xcdim,Xrfull));
> else
>    foo:=Matrix(nrows,ncols,(i,j)->MakeV0A(i+ioffset,j+joffset,N[i+ioffset,j+joffset],X,sigma,Xcdim,Xrfull));
> end if;
> return foo;
> end proc:
>
> MakeV0A:=proc(i,j,Nij,X,sigma,Xcdim,Xrfull)
> local V0,den,k,l,diffs,pot,psi,tempdiffs,temppsi,temppot,potn,chsign;
> #option autocompile;
>   
>    den:=(2*sigma^2);
>    chsign:=proc(x) -x;end proc;
>    tempdiffs:=Array(1..Xrfull);

>    seq(assign('tempdiffs'[l],add(((X[i,k]+X[j,k])/2.0-X[l,k])^2,k=1..Xcdim)/den),l=1..Xrfull);
>    temppsi:=exp~(chsign~(tempdiffs));
>    temppot:= zip((a,b)->a*b,temppsi,tempdiffs);
>    pot:=convert(temppot,`+`);
>    psi:=convert(temppsi,`+`);
>   
>    if  pot=0.0 and psi=0.0 then
>         V0:=0.0;
>    else   V0 := Nij*pot/psi;
>    end if;
> return V0;
> end proc:

> mergeV:=proc(V1,V2,V3)
> local V1rows, V2rows,V3rows,V1cols,V2cols,V3cols,totrows,totcols,Vtemp;
>
> V1rows:=Nrows(V1);
> V1cols:=Ncols(V1);
> V2rows:=Nrows(V2);
> V2cols:=Ncols(V2);
> V3rows:=Nrows(V3);
> V3cols:=Ncols(V3);
>
> if V3rows <> V1rows then
>     error("V3rows=%1 not equal to V1rows = %2",V3rows,V1rows);
> elif V3cols <> V2cols then
>     error("V3cols=%1 not equal to V2cols = %2",V3rows,V1rows);
> end if;
>
> totrows:=V1rows+V2rows;
> totcols:=V1cols+V2cols;
>
> Vtemp:=Matrix(V1rows+V2rows,V1cols+V2cols,datatype=float[8]);
> Vtemp[1..V1rows,1..V1cols]:=V1;
> Vtemp[V1rows+1..totrows,V1cols+1..totcols]:=V2;
> Vtemp[1..V1rows,V1cols+1..totcols]:=V3;
> Vtemp[V1rows+1..totrows,1..V1cols]:=HermitianTranspose(V3);
>
> return Vtemp;
>
> end proc:
>
> N:=0.05*RandomMatrix(300,300):
> N:=0.5*(N+HermitianTranspose(N)):
> XX:=0.03*RandomMatrix(3000,18):
> st:=time():Vfull:=MakeV0(40,40,1,1,N,XX,400,0.2);time()-st;
                         "numthreads executing = ",1
                           Matrix(%id = 869864272)
                                   208.938
> task:=proc(N,XX)
> local W;
> W:=Task:-Continue(mergeV,Task=[MakeV0,20,20,1,1,N,XX,400,0.2],
> Task=[MakeV0,20,20,21,21,N,XX,400,0.2],Task=[MakeV0,20,20,1,21,N,XX,400,0.2]);
> return W;
> end proc:
> st:=time():Vtask:=Task:-Start(task,N,XX);time()-st;
                         "numthreads executing = ",4
                         "numthreads executing = ",4
                         "numthreads executing = ",4
                           Matrix(%id = 869867552)
                                   314.734
> Vfull-Vtask;
 

 

Hi Darren,

I am starting to look at the CUDA stuff now that I have an NVidia Quadro card.  Eventually I want to be able to make a library of routines that I can call with define_external().  Is there a problem doing this?  Will you give us an example of how to do this in Windows?

Marvin

Actually I was talking to a VP of NVidia on Monday about the hit (which is a factor of 8 for my card).  He said that it is all a question of how they apportion the real estate on the card as all processors are already 64-bit.  Furthermore he said that the new Fermi board will only take a hit of a factor of 2.  Now that is exciting.  So, while my new home machine is already a done deal, I am clearly going to wait for my work machine.  A factor of 4 is nothing to sneeze at ;^)

 

By the way, what do you think of go? (google's new language)

1 2 Page 1 of 2