PROGRAM OPTIMIZE
Andrew Abbott
Introduction to the OPTIMIZE AlgorithmOPTIMIZE is a simple matching algorithm. It calculates a distance between
two sequences in terms of the insertions, deletions, and substitutions
required to turn one sequence into another. The calculation is done by
the standard procedure usually attributed to Needleman and Wunsch, but
actually discovered by quite a number of people simultaneously in the early
1970s.
- The two sequences are arrayed in a three way array D
D has dimension len(1)+1 x len(2)+1 x 4. where len(1) is length of the
first sequence and len(2) is length of the second sequence
- There is a fixed cost of insertions and deletions (indels) This is
called "indel cost." This is uniform throughout the matrix but
may be scaled, relative to the substitution cost.
- There is a variable cost of substitutions of one element for another.
This is called the "substitution cost." It can be found in one
of two ways. When the sequence elements are categorical variables, it will
be found in a lookup table. If the sequence elements are continuous variables,
it can calculated as a Euclidean distance. Other metrics are possible but
not yet implemented in this program. In many biological applications, there
is only one substitution cost, a "penalty" for mismatch.
- For i = 1 to 4,
- D(0,0,i) = 0
- For i = 1 to len(1)+1
- D(0,i,1) = 0
D(0,i,3) = 0
D(0,i,2) = "indel cost" - For j = 1 to len(2)+1
- D(0,i,2) = 0
D(0,i,3) = 0
D(i,0,1) = "indel cost" - For i>0 and j>0
- D(i,j,1) = D(i,j,2) = "indel cost"
D(i,j,3) = "substitution cost"
- The algorithm proceeds recursively, by columns or rows (it doesn't
matter) That is one calculates (by rows) D(0,1,4), then D(0,2,4) to D(0,len(1)+1,4)
then D(1,1,4), D(1,2,4) etc. until one reaches D(len(2)+1, len(1)+1, 4)
which is the final result
At each stage, we calculate D(i,j,4) by the following minimization:
D(i-1,j,4) + D(i,j,1)
(total cost to get to the cell above the present cell plus cost of entering
the present cell by inserting in the sequence spanning the rows [i.e. "deleting"
in the column sequence].)
D(i,j,4) = MIN D(i,j-1,4) + D(i,j,2)
(total cost to get to the cell to the left of the present cell plus cost
of entering the present cell by inserting in the sequence spanning the
columns [i.e., "deleting" in the row sequence.)
D(i-1,j-1,4) + D(i,j,3)
(total cost to get to the cell above and to the left plus cost of entering
the present cell by substituting the row element for the column element)
- Notes on other algorithms.
It is by changing this inner minimization or making it conditional on
other things that different algorithms are built. To make the algorithm
"pay less attention" if an inserted state is identical with what
"preceded it," we simply make the first two of these possibilities
conditional on the nature of the preceding state. This produces an algorithm
where "gaps" in a sequence are priced at some affine function
a + bx, where a is the initial cost, and x is the (lower) cost of simply
adding on another similar state. This retains some sensitivity to duration
without retaining strong sensitivity to "simply continuing what went
before." The current implementation is a special case, with a = 0.
Subsequence algorithms generally work by changing all of these costs
(see the workbook.) The multiple subsequence problem is far more challenging
than multiple overall sequence distances, however. Current methods are
somewhat ad hoc. A bunch of possible existing software is listed above,
but most biological applications are for few sequences with huge lengths.
They also tend to have funky front ends, since biologists have their own
IO issues.
|