![]() |
|
PROGRAM OPTIMIZEImplementationThere are a few options on command line. The major one is to give the command "Optimize print > [filename]" which will print most of the run information into a file (which will be large. Don't run the debug option on command line unless you want the world's largest file on your hard disk.) When you run OPTIMIZE the screen will ask you a number of questions, all of which relate to this calculation and the numbers necessary to do it. 1. It asks whether you want the EXPLORE module or the OPTMAT module.EXPLORE lets you actually see what the algorithm is doing with any pair of sequences. It also lets you vary the scale parameter that relates indel costs to substitution costs, so that you can see what value works best to give matchings of the data that you like. The two modules share an input routine, though, so the next few questions are in common. 2. It asks whether the costs should be calculated or looked up.At present the calculation routines are set up for square data arrays only - that is, time series data where all the time series compared are of the same length. 3. It asks for an initial scale factor.This can be changed in EXPLORE, but not in OPTMAT. Typically, one would use EXPLORE to decide on a proper scale factor, then run OPTMAT to produce the final matrix of intersequence distances. 4. It asks the names of the data files.(For hints on what these should look like, if this section isn't clear, look at the ones on the distribution disk.) a. If you answered lookup at question 2, the machine will ask now for the name of the file containing the lookup table. It should be a file in the form of a lower triangular matrix, each element of which is the interelement distance of the row and column elements. (That is, the program reas it as if it were a lower triangular matrix; in fact it can be in the form of a purely sequential file as long as the rows and columns are pulsed correctly.) The program will then ask what is the name of the actual sequence data file. It expects input where each sequence is a list of integers separated by spaces, and each sequence is terminated by the code "99", as in "1 4 34 25 3 3 3 5 99." The integers, as you might guess, are simply indices that indicate the various categories that determine the rows/cols of the inter-element distance matrix. They have no quantitative significance. b. If you have answered "calculate" at question 2, the machine asks immediately for the name of the sequence data file. The program expects an array pulsing VARIABLES within TIME PERIODS within CASES. You may thus analyze multivariate time series with the OPTIMIZE program, but be careful of the order in which you input data. Remember that at present, the program expects square data. It reads the data by using information it now asks for about how long the sequence is, how many variables there are. It calculates how many cases there are from the length of your input stream. If that calculation doesn't get it to the proper end of your data file, it will bounce you. But note that it has NO IDEA whether you put the data in in the pulsing order above. Gibberish is easy and cheap. Be careful. If you are running OPTIMIZE under the calculate option, the program will now ask for an overall foundation value (a value you will have scaled with your scale factor) for the indel cost. The reason for this is that normally this basis value is calculated by taking the difference between the two largest interelement distances, adding that distance to the larger of them, and calling the result the initial figure for an indel cost. In a lookup table, the machine figures this out when it reads the table, and tells it to you as an option at the beginning of optmat. But when you are calculating substitutions as you go along, the program would have to pass through the entire dataset twice to find a general value of this kind - once to find the largest distance between ANY pair of points in ANY pair of sequences, then once to use that as the foundation cost. That seems too costly, so you have to set a value. This is one thing that EXPLORE is for - to test out values for indel scaling and for indels. 5. Running the OPTMAT moduleIf you are running the OPTMAT module, the program will now calculate the distances between sequences, print the distance matrix on screen, and ask you if you want to save it to a file. A dot display shows you exactly where you are in the problem, so that you don't feel lonely while the machine calculates n * (n-1) / 2 intersequence distances. Since you are probably going to do further analysis on the distances - like clustering or scaling - you will probably want to save. At present, if you want a record of what your inputs were, which were the longest sequences, etc., you need to have given the print redirection on the command line. (We'll fix this to a prompt.) Then you have to find the information in the middle of that long printout file. Once you have saved, the program will ask you if you want to do another pass. You are inside the program loop, however. If you want to do a new dataset, you have to exit and return. If, on the other hand, you want to run the same data with a different scale factor, answering OPTMAT or EXPLORE at this point will take you back into the program with the data already resident. 6. Running the EXPLORE moduleIf you chose explore, then once you have answered the questions at #4 above, the program will again ask for an initial scale factor. This will replace whatever value you already declared, although of course you can enter the same one again. (It asks you for it because you may be looping here from a previous run.) Tell it a scale factor. The machine will then ask you what two sequences to display. What EXPLORE does is show you how the algorithm has solved the relation between a particular pair of sequences. A. The Substitution Cost display:The EXPLORE screen has three parts. On the left is an array with as many columns as there are elements in S1, the first sequence you just declared, and as many rows as there are elements in s2, the second of those two sequences. Actually, there are S1+1 columns and S1+1 rows; the ones at the top and the side merely establish the "null" beginnings of the sequences. At each point in the matrix, the program places a circle. The bigger the circle, the bigger the substitution cost between that row element and that column element. (Eventually, the circles would overlap, so over a certain cost, they are the same size.) If you have less than 99 different types of elements (probable), the rows and columns should be labeled with the actual data for these sequences (if you are doing lookups; in calculate you get simply a listing of where you are in the sequence.) The actual solution route taken by the algorithm is displayed on the diagram as a yellow line. It will be generated from the lower right to the upper left, since it is a backward solution. The yellow line(s) tell you the actual minimum path of substitutions, insertions, and deletions that match up the two sequences; in technical terms, this is called the alignment of the two sequences. Note that there may be several best alignments. More will appear as you allow for more indelling with lower scale factors. On this diagram, a vertical one-unit yellow line means that the row element was aligned with a null element in the column sequence (this could be considered an insertion in the sequence defining the rows or a deletion in the sequence defining the columns.) A horizontal one-unit line means that the column element was aligned with a null element in the row sequence (a deletion in the row sequence or an insertion in the column sequence). A diagonal line means the substitution of row element for column element, which would be written in an alignment as simply one letter over the other. Note that the algorithm is more or less trying to jump diagonally ONTO small circles or blanks (perfect matches; jumping onto means moving down and right). This shows that it has grabbed a cheap substitution, for a small circle signifies a low substitution cost to enter that location. If you run EXPLORE with the enclosed "optest" data [seqfile = optest.seq, Interelement distance file = Optest.ied], you will see how the algorithm jumps onto blanks (which are perfect matches). B. The Diagnostic Section (Red):On the upper right of the EXPLORE screen are a variety of diagnostic tools to help you figure out how to shift the indel cost and possibly get a more effective alignment.
C. The Control Legend (White):At the lower right are the directions for how to change parameters and refresh the screen.
D. Using EXPLORE:The main point of using explore is to see whether the algorithm is picking up the close matches that you want it to, without using zillions of indels. Remember that any pair of equal length sequences can be aligned trivially with an alignment of twice as long as either sequence. The idea is to pick up the local matches without paying to high a cost in terms of indels. It is wise to keep a record of what scale factor worked best with what sequences and above all to write down what actual indel cost worked best with each pair. This may vary from pair to pair. You will ultimately have to choose a best possibility. It is wise to check out this best possibility on a variety of sequences. Since you can't directly input indel cost in explore, you will have to play with the scale factor to get indel cost to equal what you want here, as you go through checking how your proposed "best" will work with various pairs. Remember that with calculation of substitution costs, EXPLORE is recalculating the BASIS for indel costs for every pair. Therefore it is the ACTUAL indel cost that is of interest to you, not the scale factor. When you choose an indel cost for a final run, if you have been working with calculation, you should exit the program and reenter. Otherwise the basis on which your scale factor is multiplied will be whatever happened to be the calculated indel basis for the last pair you analyzed in EXPLORE. You want to enter a proper figure, and will be asked for it by OPTMAT if you run again from the outside. So do that. |
|
|
| |
|||
|
Contact Information: |
|||