:

## The Needleman-Wunsch algorithm for biological sequence alignment, in Maple

Maple

Recently I was skimming on the Internet for algorithms for alignment of biological sequences (e.g. DNA sequences, protein sequences). The usual purpose of such comparisons is to determine the evolutionary genetic history of two sequences, which in the case of DNA you can think of as strings over the alphabet {A,G,C,T} (the nucleobases). Differences between sequences can arise though substitutions (A → G, G → A, C → T, or T → C), insertions, or deletions. In the last two cases, the resulting string differs in length from the original, and dealing with these size differences is the chief problem that alignment algorithms face. The Needleman-Wunsch algorithm is a very straightforward global sequence alignment algorithm, first proposed in 1970. It's a good example of the applicability of dynamic programming towards biological problems. Following is a Maple implementation based on this Ruby implementation. The algorithm depends on a similarity matrix which measures the "mutational distance" between two nucleobases. It takes two input strings, and returns two strings with dashes ("-") inserted to indicate insertion or deletion events between the sequences.

I have not attempted any optimisations at all and it's quite slow, though I have read that the runtime can be brought to O(m+n) from O(mn), and I imagine that Maple's `rtable_scanblock` could be used to substantially reduce its runtime. Any suggestions or revisions are welcome.

Notice that `NeedlemanWunsch` is a module; this is so that the similarity matrix `SM`does not have to be assigned with each procedure invocation.

`NeedlemanWunsch := module()`    local ModuleApply, SM, gap;
# similarity matrix
#     A  G  C  T
#  A 10 -1 -3 -4
#  G -1  7 -5 -3
#  C -3 -5  9  0
#  T -4 -3  0  8
SM := table([
("A","A") = 10, ("A","G") = -1, ("A","C") = -3, ("A","T") = -4,
("G","A") = -1, ("G","G") =  7, ("G","C") = -5, ("G","T") = -3,
("C","A") = -3, ("C","G") = -5, ("C","C") =  9, ("C","T") =  0,
("T","A") = -4, ("T","G") = -3, ("T","C") =  0, ("T","T") =  8
]):
gap := -5;

ModuleApply := proc(sequence::string, reference::string)
local rows, cols,
a, i, j,
choice1, choice2, choice3,
ref, seqq,
score, score_diag, score_up, score_left;
rows := length(reference);
cols := length(sequence);
a := Array(0..rows,0..cols);
for i from 1 to rows do
for j from 1 to cols do
choice1 := a[i-1, j-1] + SM[reference[i], sequence[j]];
choice2 := a[i-1, j] + gap;
choice3 := a[i, j-1] + gap;
a[i,j]  := max(choice1, choice2, choice3);
end do;
end do;
ref  := StringTools:-StringBuffer();
seqq := StringTools:-StringBuffer();
i := length(reference);
j := length(sequence);
while (i>0) and (j>0) do
score := a[i,j];
score_diag := a[i-1, j-1];
score_up   := a[i, j-1];
score_left := a[i-1, j];
if (score = score_diag + SM[reference[i], sequence[j]]) then
ref:-append( reference[i] );
seqq:-append( sequence[j] );
i := i-1;
j := j-1;
elif (score = score_left + gap) then
ref:-append( reference[i] );
seqq:-append( "-" );
i := i-1;
elif (score = score_up + gap) then
ref:-append( "-" );
seqq:-append( sequence[j] );
j := j-1;
end if;
end do;
while (i > 0) do
ref:-append( reference[i] );
seqq:-append( "-" );
i := i-1;
end do;
while (j > 0) do
ref:-append( "-" );
seqq:-append( sequence[j] );
j := j-1;
end do;
map( StringTools:-Reverse, [seqq:-value(), ref:-value()] )
end proc:
end module:

Example:
> NeedlemanWunsch("AGCTAGCT", "AGCAGCT");

["AGCTAGCT", "AGC-AGCT"] ﻿