optimal matching back to optimal matching main page

PROGRAM OPTIMIZE

Andrew Abbott

Implementation

There are a few options on command line. The major one is to give the command "Optimize print > [filename]" which will print most of the run information into a file (which will be large. Don't run the debug option on command line unless you want the world's largest file on your hard disk.)

When you run OPTIMIZE the screen will ask you a number of questions, all of which relate to this calculation and the numbers necessary to do it.

1. It asks whether you want the EXPLORE module or the OPTMAT module.

EXPLORE lets you actually see what the algorithm is doing with any pair of sequences. It also lets you vary the scale parameter that relates indel costs to substitution costs, so that you can see what value works best to give matchings of the data that you like. The two modules share an input routine, though, so the next few questions are in common.

2. It asks whether the costs should be calculated or looked up.

At present the calculation routines are set up for square data arrays only - that is, time series data where all the time series compared are of the same length.

3. It asks for an initial scale factor.

This can be changed in EXPLORE, but not in OPTMAT. Typically, one would use EXPLORE to decide on a proper scale factor, then run OPTMAT to produce the final matrix of intersequence distances.

4. It asks the names of the data files.

(For hints on what these should look like, if this section isn't clear, look at the ones on the distribution disk.)

a. If you answered lookup at question 2, the machine will ask now for the name of the file containing the lookup table. It should be a file in the form of a lower triangular matrix, each element of which is the interelement distance of the row and column elements. (That is, the program reas it as if it were a lower triangular matrix; in fact it can be in the form of a purely sequential file as long as the rows and columns are pulsed correctly.) The program will then ask what is the name of the actual sequence data file. It expects input where each sequence is a list of integers separated by spaces, and each sequence is terminated by the code "99", as in "1 4 34 25 3 3 3 5 99." The integers, as you might guess, are simply indices that indicate the various categories that determine the rows/cols of the inter-element distance matrix. They have no quantitative significance.

b. If you have answered "calculate" at question 2, the machine asks immediately for the name of the sequence data file. The program expects an array pulsing VARIABLES within TIME PERIODS within CASES. You may thus analyze multivariate time series with the OPTIMIZE program, but be careful of the order in which you input data. Remember that at present, the program expects square data. It reads the data by using information it now asks for about how long the sequence is, how many variables there are. It calculates how many cases there are from the length of your input stream. If that calculation doesn't get it to the proper end of your data file, it will bounce you. But note that it has NO IDEA whether you put the data in in the pulsing order above. Gibberish is easy and cheap. Be careful.

If you are running OPTIMIZE under the calculate option, the program will now ask for an overall foundation value (a value you will have scaled with your scale factor) for the indel cost. The reason for this is that normally this basis value is calculated by taking the difference between the two largest interelement distances, adding that distance to the larger of them, and calling the result the initial figure for an indel cost. In a lookup table, the machine figures this out when it reads the table, and tells it to you as an option at the beginning of optmat. But when you are calculating substitutions as you go along, the program would have to pass through the entire dataset twice to find a general value of this kind - once to find the largest distance between ANY pair of points in ANY pair of sequences, then once to use that as the foundation cost. That seems too costly, so you have to set a value. This is one thing that EXPLORE is for - to test out values for indel scaling and for indels.

5. Running the OPTMAT module

If you are running the OPTMAT module, the program will now calculate the distances between sequences, print the distance matrix on screen, and ask you if you want to save it to a file. A dot display shows you exactly where you are in the problem, so that you don't feel lonely while the machine calculates n * (n-1) / 2 intersequence distances. Since you are probably going to do further analysis on the distances - like clustering or scaling - you will probably want to save. At present, if you want a record of what your inputs were, which were the longest sequences, etc., you need to have given the print redirection on the command line. (We'll fix this to a prompt.) Then you have to find the information in the middle of that long printout file. Once you have saved, the program will ask you if you want to do another pass. You are inside the program loop, however. If you want to do a new dataset, you have to exit and return. If, on the other hand, you want to run the same data with a different scale factor, answering OPTMAT or EXPLORE at this point will take you back into the program with the data already resident.

6. Running the EXPLORE module

If you chose explore, then once you have answered the questions at #4 above, the program will again ask for an initial scale factor. This will replace whatever value you already declared, although of course you can enter the same one again. (It asks you for it because you may be looping here from a previous run.) Tell it a scale factor.

The machine will then ask you what two sequences to display. What EXPLORE does is show you how the algorithm has solved the relation between a particular pair of sequences.

A. The Substitution Cost display:

The EXPLORE screen has three parts. On the left is an array with as many columns as there are elements in S1, the first sequence you just declared, and as many rows as there are elements in s2, the second of those two sequences. Actually, there are S1+1 columns and S1+1 rows; the ones at the top and the side merely establish the "null" beginnings of the sequences. At each point in the matrix, the program places a circle. The bigger the circle, the bigger the substitution cost between that row element and that column element. (Eventually, the circles would overlap, so over a certain cost, they are the same size.) If you have less than 99 different types of elements (probable), the rows and columns should be labeled with the actual data for these sequences (if you are doing lookups; in calculate you get simply a listing of where you are in the sequence.)

The actual solution route taken by the algorithm is displayed on the diagram as a yellow line. It will be generated from the lower right to the upper left, since it is a backward solution. The yellow line(s) tell you the actual minimum path of substitutions, insertions, and deletions that match up the two sequences; in technical terms, this is called the alignment of the two sequences. Note that there may be several best alignments. More will appear as you allow for more indelling with lower scale factors.

On this diagram, a vertical one-unit yellow line means that the row element was aligned with a null element in the column sequence (this could be considered an insertion in the sequence defining the rows or a deletion in the sequence defining the columns.) A horizontal one-unit line means that the column element was aligned with a null element in the row sequence (a deletion in the row sequence or an insertion in the column sequence). A diagonal line means the substitution of row element for column element, which would be written in an alignment as simply one letter over the other. Note that the algorithm is more or less trying to jump diagonally ONTO small circles or blanks (perfect matches; jumping onto means moving down and right). This shows that it has grabbed a cheap substitution, for a small circle signifies a low substitution cost to enter that location. If you run EXPLORE with the enclosed "optest" data [seqfile = optest.seq, Interelement distance file = Optest.ied], you will see how the algorithm jumps onto blanks (which are perfect matches).

B. The Diagnostic Section (Red):

On the upper right of the EXPLORE screen are a variety of diagnostic tools to help you figure out how to shift the indel cost and possibly get a more effective alignment.

  1. The two sequences you are looking at - S4 and S6 mean the fourth and the sixth sequences in the data. The length of each sequence is also displayed.
  2. The current scale factor and the indel cost (which is the base indel cost multiplied by this scale factor).
  3. the number of substitutions used. Note that in a square array - two sequences of equal length - this could potentially equal the sequence length, if you go straight up the diagonal. In a rectangular array, the maximum value is the length of the shorter sequence.
  4. length (of the alignment) = number of substitutions + the number of indels. Note that in a square array the number of indels will always be twice the difference between the sequence length and the number of subsitutions; every non-diagonal move takes two indels to make a move "on the square." In a rectangular array, length = number of substitutions + twice the differerence between the shorter sequence length and the number of subsitutions + the difference between the two sequence lengths. The minimum length of a rectangular alignment is the length of the longer sequence. The length of the alignment as related to its length is diagnostic of how many indels you are using and therefore of how good the alignment is. In a rectangular array, you are aiming to use some indels, but not to allow indels to do most of the work. The presumption is that the algorithm is finding local similarities and that sequences that are alike have some shifting of things here and there but not, for example, shifts where the front of one sequence becomes the back of the other and vice versa. (That's a job for subsequence alignment procedures.) An alignment that is mostly indels (lots of squares on the yellow line) is not a good alignment. It may happen that parts of the alignment are blocky and other parts single lines. In this case you have parsed your sequence pair into areas of better and worse alignment.
  5. Some cost figures
  6. Substitution cost.
    The sum of the actual substitutions costs where substitution was used.
    Total cost.
    The sum of all the costs in the alignment. This is the figure that is output to the distance matrix at the end of OPTMAT (after standardizations).
    Diag cost
    The cost of going straight up the diagonal. This is not meaningful (and not displayed) for the rectangular case.
    Len/cost
    The length divided by the total cost. This should actually be inverted, for it is then a measure of the average cost of each step in the alignment.
    Diag/cost
    The ratio of the diagonal cost to the actual cost. Printed only for square arrays. This tells you how much you have gained, in a sense, by leaving the high road of the main diagonal. Note that this will not print with a rectangular array, as it then has no meaning.

C. The Control Legend (White):

At the lower right are the directions for how to change parameters and refresh the screen.

Page UP / Page Down
change the scale factor by .01 in the indicated direction.
ALT + Page UP, ALT + Page Down
change the scale factor by 0.001 in the indicated direction. If you have two sets of Page Up and Page Down keys, you may need to try both.
CTRL + Page UP, CTRL + Page Down
change the scale factor by 0.1 in the indicated direction.
The left / right arrow keys
change the sequence on the columns. Right moves you to the next higher sequence.
The up / down arrow keys
change the sequence on the rows. Down moves you to the next higher sequence.
Escape
removes you from the module
Enter
refreshes the screen with the new scale parameters and sequences.

D. Using EXPLORE:

The main point of using explore is to see whether the algorithm is picking up the close matches that you want it to, without using zillions of indels. Remember that any pair of equal length sequences can be aligned trivially with an alignment of twice as long as either sequence. The idea is to pick up the local matches without paying to high a cost in terms of indels. It is wise to keep a record of what scale factor worked best with what sequences and above all to write down what actual indel cost worked best with each pair. This may vary from pair to pair. You will ultimately have to choose a best possibility. It is wise to check out this best possibility on a variety of sequences. Since you can't directly input indel cost in explore, you will have to play with the scale factor to get indel cost to equal what you want here, as you go through checking how your proposed "best" will work with various pairs.

Remember that with calculation of substitution costs, EXPLORE is recalculating the BASIS for indel costs for every pair. Therefore it is the ACTUAL indel cost that is of interest to you, not the scale factor. When you choose an indel cost for a final run, if you have been working with calculation, you should exit the program and reenter. Otherwise the basis on which your scale factor is multiplied will be whatever happened to be the calculated indel basis for the last pair you analyzed in EXPLORE. You want to enter a proper figure, and will be asked for it by OPTMAT if you run again from the outside. So do that.

Introduction to Optimize Algorithm Back to Top Limits

Contact Information:
Address: 1126 E 59th St. Chicago, IL 60637
Office: (773) 702-4545 Fax: (773) 702-4849
Email to: a-abbott@uchicago.edu