Parallelizing code for solving matrix equation AX=B (over a finite field with A upper triangular) by splitting B into one piece for each CPU.restart:The float[8] technique used in this worksheet is extremely fast and is limited to moduli less than 2^25. If you need a larger modulus, it is very easy to modify (let me know), and just a little slower.N:= 2^7: #order of matrix A
n:= 2^14: #number of vectors b[i] (number of columns of B)
p:= 127: #prime modulusisprime(p);Generate a random example upper triangular nonsingular A, convert to shape= rectangular (required by LinearAlgebra:-Modular), and input to Modular. A:= LinearAlgebra:-RandomMatrix(
N, N,
shape= triangular[upper, unit],
datatype= float[8]
):A:= Matrix(A, shape= rectangular):A:= LinearAlgebra:-Modular:-Mod(p, A, float[8], C_order):Modular works much faster (two to three times faster) in C_order rather than Fortran_order.Generate a list of random side vectors b[k].b:= [seq](
LinearAlgebra:-RandomVector(
N, datatype= float[8]
), k= 1..n
):We assume that all the side vectors are stored in a single Matrix to start with.B:= `<|>`(b[]):B:= CodeTools:-Usage(
LinearAlgebra:-Modular:-Mod(p, B, float[8], C_order)
):This solving procedure is used for all solutions.Solve:= proc(B::Matrix, A::Matrix, p::posint)
#Command below operates on B inplace.
LinearAlgebra:-Modular:-BackwardSubstitute(p, A, B);
B
end proc:Get the number of CPUs so that we can compute the size of the pieces into which to divvy up the side vectors such that there is one piece per CPU.NC:= kernelopts(numcpus);The method is to divide B into NC equal-sized pieces (or as close to equal as possible). First, I will use the method in sequential (i.e., not parallel) code, just for comparison purposes.Xs:= CodeTools:-Usage(
[seq](
Solve(
B[.., trunc((k-1)*n/NC)+1..trunc(k*n/NC)],
A, p
), k= 1..NC
)
):To make the above run in parallel, only one change is needed: Change seq to Threads:-Seq[tasksize= 1].Xs:= CodeTools:-Usage(
[Threads:-Seq[tasksize= 1]](
Solve(
B[.., trunc((k-1)*n/NC)+1..trunc(k*n/NC)],
A, p
), k= 1..NC
)
):The above solution is returned as a list of NC matrices. If we put them together into a single matrix, then the time required eliminates any benefit obtained by parallelizing. However, having the matrices already divided into NC pieces might be useful in parallelizing the next step, whatever that is.X:= CodeTools:-Usage(`<|>`(Xs[])):Verify results (not really necessary). Verify A.X - B = 0. ArrayTools:-IsZero(
LinearAlgebra:-Modular:-AddMultiple(
p,
-1 mod p,
LinearAlgebra:-Modular:-Multiply(p, A, `<|>`(Xs[])),
LinearAlgebra:-Modular:-Mod(p, B, float[8])
)
);Plain solve of the whole matrix at once. This is the time to beat. (Note the the command below overwrites B.)X:= CodeTools:-Usage(Solve(B, A, p)):So there is some benefit to parallelising if we accept having the solution divided into several matrices.