I use threads directly, with primitives for memory barriers, compare and swap, and short sleeps. We went for maximum parallel performance and left nothing to chance. We split up the polynomials manually and multiply with all cores running at full tilt. The threads cooperate to combine the results, and use a load-balancing scheme with no communication overhead. The whole thing runs in the CPU cache. It's a true parallel algorithm designed for many-core processors. It is currently compiled for multicore cpus, which use fewer threads due to higher scheduling latency and faster processing versus communication.
You probably couldn't port these algorithms to a GPU. They're sparse algorithms which inherently involve a lot of branching, like heapsort. GPUs expect fairly uniform computations over independent sets of data, so it just isn't a match. GPUs would be great for dense polynomial arithmetic though.

Thanks for your kind comment. The source is not available, and probably won't be anytime soon. It's still under heavy development, and every year I take an axe to it to simplify the code and do something new. There is no programmer interface exposed in Maple. Instead, we have tried to describe all the details in the papers. I'm happy to answer further questions or provide tips to anyone working on this.
In Maple 14 we hooked the routines into expand and divide, but we did a bit more than that. We lowered the thresholds for the kernel to call out to `expand/bigprod` and `expand/bigdiv` where our code is used. And we wrote code for `expand/bigprod` so that polynomials with rational coefficients, functions such as sin(x), and even rational function coefficients would run our code. These extra steps were critical to maximize the performance, and you won't get nearly the same improvement (e.g. in factor) from simply hooking our code into Maple 13. It's a finished product.
Here's another example from the

Trip paper to illustrate. I can't supply times since the beta license expired and we're waiting for the official release to be installed in our lab, but the improvement from Maple 13 to 14 is massive, and Maple 14 is the fastest system out there that I know of.

f := expand((1/3*x + 3/5*y + 5/7*z + 7/11*t)^20): g := f+1:
gc(): TIMER := time[real](): p := expand(f*g): time[real]()-TIMER;
gc(): TIMER := time[real](): divide(p,f,'q'): time[real]()-TIMER;

Are we going to have another round of importing posts after the new site launches?

Multiplying by a constant is there for linear algebra. It's a fairly basic thing to scale a vector, and it's used a lot in algorithms. I guess nobody thought that translation was similarly important.

They're internal functions which you should avoid unless you need them or want to do things directly.

They're internal functions which you should avoid unless you need them or want to do things directly.

Then it's a much easier problem:
alias(a=RootOf(1+z^31+z^23+z^11+z^7+z^2+z^32));
Gcdex(x^32,x^32+(a^15+a^11+a^6)*x^31+(a^13+a^10+a^3)*x^23+(a^11+a^7+a^2+1)*x^11+x^7+a^13+a^10+a^3+a^1+1,x,'s','t') mod 103;

Then it's a much easier problem:
alias(a=RootOf(1+z^31+z^23+z^11+z^7+z^2+z^32));
Gcdex(x^32,x^32+(a^15+a^11+a^6)*x^31+(a^13+a^10+a^3)*x^23+(a^11+a^7+a^2+1)*x^11+x^7+a^13+a^10+a^3+a^1+1,x,'s','t') mod 103;

Your matrices have symbolic entries. That's very different from software that uses sparse structured algorithms on floating point matrices, because you are going to get large polynomials. Maple can solve 100000 x 100000 sparse structured linear systems too. Wanna see? Download and save: http://www.cecm.sfu.ca/~rpearcea/sge/largesys
read "largesys": # load it in
infolevel[solve] := 5:
sol := SolveTools:-Linear(eqns,vars):
That system is fairly easy, harder systems will take a few minutes. Eigenvalue problems are also often harder than solving.

Your matrices have symbolic entries. That's very different from software that uses sparse structured algorithms on floating point matrices, because you are going to get large polynomials. Maple can solve 100000 x 100000 sparse structured linear systems too. Wanna see? Download and save: http://www.cecm.sfu.ca/~rpearcea/sge/largesys
read "largesys": # load it in
infolevel[solve] := 5:
sol := SolveTools:-Linear(eqns,vars):
That system is fairly easy, harder systems will take a few minutes. Eigenvalue problems are also often harder than solving.

f := a00*x^123+a45*x^233+a02*x^123+a67*x^156+a47*x^67;
C := [coeffs(f,x,'M')]: # coefficients
M := [M]: # monomials
S := [seq([C[i], M[i], degree(M[i])], i=1..nops(C))]:
S := sort(S, (a,b)->evalb(a[3] < b[3]));

f := a00*x^123+a45*x^233+a02*x^123+a67*x^156+a47*x^67;
C := [coeffs(f,x,'M')]: # coefficients
M := [M]: # monomials
S := [seq([C[i], M[i], degree(M[i])], i=1..nops(C))]:
S := sort(S, (a,b)->evalb(a[3] < b[3]));

It assigned the coefficients of f to C, so C[1] is the coefficient for the monomial M[1], etc.

It assigned the coefficients of f to C, so C[1] is the coefficient for the monomial M[1], etc.

That's the maximum number of digits (base 10), which is about a 100MB integer. Not bad for a 32-bit machine. On a 64-bit machine the limit is 15GB.