optimal matching back to optimal matching main page

PROGRAM OPTIMIZE

Andrew Abbott

Introduction to the OPTIMIZE Algorithm

OPTIMIZE is a simple matching algorithm. It calculates a distance between two sequences in terms of the insertions, deletions, and substitutions required to turn one sequence into another. The calculation is done by the standard procedure usually attributed to Needleman and Wunsch, but actually discovered by quite a number of people simultaneously in the early 1970s.

  • The two sequences are arrayed in a three way array D
    D has dimension len(1)+1 x len(2)+1 x 4. where len(1) is length of the first sequence and len(2) is length of the second sequence

  • There is a fixed cost of insertions and deletions (indels) This is called "indel cost." This is uniform throughout the matrix but may be scaled, relative to the substitution cost.

  • There is a variable cost of substitutions of one element for another. This is called the "substitution cost." It can be found in one of two ways. When the sequence elements are categorical variables, it will be found in a lookup table. If the sequence elements are continuous variables, it can calculated as a Euclidean distance. Other metrics are possible but not yet implemented in this program. In many biological applications, there is only one substitution cost, a "penalty" for mismatch.
  • For i = 1 to 4,
    D(0,0,i) = 0
    For i = 1 to len(1)+1
    D(0,i,1) = 0
    D(0,i,3) = 0
    D(0,i,2) = "indel cost"
    For j = 1 to len(2)+1
    D(0,i,2) = 0
    D(0,i,3) = 0
    D(i,0,1) = "indel cost"
    For i>0 and j>0
    D(i,j,1) = D(i,j,2) = "indel cost"
    D(i,j,3) = "substitution cost"


  • The algorithm proceeds recursively, by columns or rows (it doesn't matter) That is one calculates (by rows) D(0,1,4), then D(0,2,4) to D(0,len(1)+1,4) then D(1,1,4), D(1,2,4) etc. until one reaches D(len(2)+1, len(1)+1, 4) which is the final result
  • At each stage, we calculate D(i,j,4) by the following minimization:
    D(i-1,j,4) + D(i,j,1)
    (total cost to get to the cell above the present cell plus cost of entering the present cell by inserting in the sequence spanning the rows [i.e. "deleting" in the column sequence].)

    D(i,j,4) = MIN D(i,j-1,4) + D(i,j,2)
    (total cost to get to the cell to the left of the present cell plus cost of entering the present cell by inserting in the sequence spanning the columns [i.e., "deleting" in the row sequence.)

    D(i-1,j-1,4) + D(i,j,3)
    (total cost to get to the cell above and to the left plus cost of entering the present cell by substituting the row element for the column element)

  • Notes on other algorithms.
  • It is by changing this inner minimization or making it conditional on other things that different algorithms are built. To make the algorithm "pay less attention" if an inserted state is identical with what "preceded it," we simply make the first two of these possibilities conditional on the nature of the preceding state. This produces an algorithm where "gaps" in a sequence are priced at some affine function a + bx, where a is the initial cost, and x is the (lower) cost of simply adding on another similar state. This retains some sensitivity to duration without retaining strong sensitivity to "simply continuing what went before." The current implementation is a special case, with a = 0.

    Subsequence algorithms generally work by changing all of these costs (see the workbook.) The multiple subsequence problem is far more challenging than multiple overall sequence distances, however. Current methods are somewhat ad hoc. A bunch of possible existing software is listed above, but most biological applications are for few sequences with huge lengths. They also tend to have funky front ends, since biologists have their own IO issues.

Biological Software Back to Top Implementation

Contact Information:
Address: 1126 E 59th St. Chicago, IL 60637
Office: (773) 702-4545 Fax: (773) 702-4849
Email to: a-abbott@uchicago.edu