optimal matching back to optimal matching main page

PROGRAM OPTIMIZE

Andrew Abbott

Preliminary Notes

Optimal matching or optimal alignment is a family of dynamic programming techniques designed to produce distances between data items that are whole sequences of events. The general approach is to measure those distances in terms of various elementary operations that turn one sequence into another. The most standard elementary operations are insertion or deletion of elements and substitution of elements. These elementary operations obey certain cost constraints, which are then minimized recursively through the sequence to produce a final total cost. The general name for these kind of metrics is Levenshtein metrics.

The basic text for all these techniques is Sankoff and Kruskal (1983). A recent major collection is Doolittle (1990). See also Miura (1986) and Gribskov and Devereux (1992). (If you can wait a year, I am finishing a review essay on sequence analysis algorithms for the Journal of Mathematical Sociology.) To master the basic ideas, you need to read the introductory chapter of Sankoff and Kruskal (1983) and at least two or three typical analyses. My suggestions for the latter are some of my applications (Abbott and Forrest 1986, Abbott and Hrycak 1990, Abbott and Deviney 1992) and the original Beldings sparrows paper, Chapter 6 in Sankoff and Kruskal. The main thing to realize is that these techniques are not very unusual or difficult mathematically. Most of the "difficulty" lies 1) in establishing reasonable substitution costs for a particular application, 2) in getting used to the facts that the methods do NOT think about time as directed towards the future (that's why they work as well on linear space as on time) and that they make NO stochastic assumptions at all, and 3) in dealing with the unfamiliar (to sociologists) problem of what to do with the distances that you get at the end, which require methods like scaling and clustering that we don't use much.

All of the applications listed here work with only indels (= insertions and deletions) and substitutions. Some applications (e.g., working with time series) also involve the operation of compression or expansion, changing the rate of time in one of the sequences relative to the other. This is discussed in Chapter 4 of Sankoff and Kruskal. I am implementing it now in some continuous variable algorithms.

Bear in mind that the algorithm discussed below and implemented on the diskette is only the simplest of the dozens of algorithms available for optimal alignment, each one tailored to a particular set of assumptions about the generation of the sequences involved. I am investigating whether it is preferable to implement more of these algorithms or to simply refer people to the biological software, which is now more user friendly than it was a few years ago when I began in this area.

Here are some notes on the present cutting edge in alignment methods. Those interested in array processing have shown that generalized Levenshtein metrics can be defined on them (Amir and Landau 1991). There is also a substantial literature on resemblance of trees, which stand half way between my unilinear sequences and Abell's branching and reconnecting (but acyclic) digraphs. See Zhang and Shasha (1989) for an algorithm and further citations. In biology, the problem of structures more complex than linear sequences has been addressed largely in terms of "secondary structure," the bends and folds by which an RNA molecule brings different parts of itself into connection. The methods developed for secondary structure are not directly transferrable to Abell's network resemblance problem. See Sankoff 1985, Sali and Blundell 1990, and various essays in Doolittle 1990, section IV. There is a large literature on merging a number of versions of a sequence into one "consensus" version. This is the "multiple alignment" problem in biology. Representative papers are Altschul 1989, Altschul and Lipman 1989, Altschul, Carroll, and Lipman 1989, Hein 1989, and Lipman, Altschul, and Kececioglu 1989. A current general review of the whole area (in biological applications) is Weir 1988. See also the many essays in Doolittle 1990. Note that alternatives to Levenshtein metrics are sometimes considered, but always within the dynamic programming framework (see Ehrenfeucht and Haussler 1988). Also, if you do venture into the biological literature, be advised the terminology in sequence analysis is NOT standardized, and you may have to figure out that my "indels" are "gaps" in the biology literature, my "costs" their "penalties," and so on. I use the language from Sankoff and Kruskal (1983).

There have been some serious advances in algorithms in the last couple of years. First, the multiple alignment problem for subsequences as seen several new approaches; the most important is Lawrence et al., 1993. Second, and more important from the sociologists' point of view, the computational community has begun addressing the tradeoff of substitution and indel costs via the concept of tesselations. For a review see Vingron and Waterman, 1994. Multiple substitution costs, however, remain a black box, since the tesselated solution space involved has one more dimension than the number of distinguishable substitution costs.

I have written a review essay on sequence analysis for the Annual Review of Sociology. By this time, these algorithms will undoubtedly have been updated.


Introduction Back to Top Biological Software

Contact Information:
Address: 1126 E 59th St. Chicago, IL 60637
Office: (773) 702-4545 Fax: (773) 702-4849
Email to: a-abbott@uchicago.edu