optimal matching back to optimal matching main page

SEQUENCE ANALYSIS

Annual Review of Sociology, 21: 93-113, 1995.
Andrew Abbott
Department of Sociology, University of Chicago

Methods

Researchers with sequence datasets and sequence-related questions can follow a fairly clear decision tree in their search for methods. The first decision is whether to analyze actual data or to construct data based on parameters from actual data, that is, to simulate. The latter is the choice of game theory and its relatives, and I do not treat it extensively, as Oliver (1993) has reviewed a related topic recently and Macy discusses such issues elsewhere in this volume. The central problem with simulation is the semantic one of justifying the application of a particular simulation to a particular body of data. Unfortunately, it is usually true that many simulations can be constructed to "fit" a body of data, given sufficient assumptions about parameters. The justification of any particular model is therefore often very weak. In general, simulation buys syntactic power - internal logical consistency and, often, a wonderfully counterintuitive unpredictability - at the price of this semantic obscurity. It is worth noting, however, that mathematical models of this type have a considerable history in sociology (e.g., Granovetter 1978), following on Rashevsky's curious work in the 1960s (Rashevsky 1968).

If one is not using simulation-based methods, the next decision, as I have made clear elsewhere (1992), is whether or not one wishes to treat the sequence as a whole or step by step.

Step by Step Methods

The step by step choice leads to time series, Markovian, and event history methods. If the central interest is a fairly deep and complex dependence of an interval-measured sequence upon its own past, then one applies time series methods. These aim to find a simple stochastic generator that effectively fits an entire sequence. It may involve autoregression, moving averages, or both in combination, and may reach varying depths into the past. The basic idea of time series analysis is to write a model which is presumably causal - mechanisms must be postulated that dictate the particular model structure - lag, averaging, and so on. New developments include introduction of "event" type disturbances in time series, on which see Isaac et al. 1991. The definitive text is Box et al. 1994.

When the variable of interest is categorical, the step by step analyst employs traditional Markov methods, aiming to fit sequences of categories by estimating transition probabilities step by step and perhaps invoking either deeper past dependence (higher order) or changing probabilities (non- stationarity) to account for misfit of a one-step model. Despite early widespread application in mobility studies, Markovian analysis has not flourished in sociology, because it delivers powerful predictions only in the case of stationary processes (which are rare) and is practical only when the models involve just one previous time period and a fairly small state space. A useful review of this early but lapsed literature is Boudon (1973), although Markov models are still ocassionally used (see Brent & Sykes 1979 and Manderscheid et al 1982.)

Where one is interested in transitions from only one particular prior category, and the issue is time till transition (e.g., how long is it before married people get divorced), one has event history methods, known outside sociology as duration methods, hazard methods, failure analysis, and various other things. These have begun to see very widespread application in sociology, as any perusal of ASR or AJS shows. Although I personally have reservations about these methods (Abbott 1983, 1992b), they are widely used and must be understood by any serious sequence analyst. A current review is Yamaguchi (1991).

Whole Sequence Methods

If on the other hand one wishes to treat sequences as whole units, then one must have recourse to special methodologies. However, it is useful, first, to note the basic questions of interest here. The central issue in whole sequence analysis is nearly always whether there are patterns in the sequences, either over the whole sequences or within parts of them. One may wish to ask where these patterns come from (making them dependent variables) or what they mean for the future (making them independent variables) but the first problem is always to figure out whether the patterns are there.

There are two broad types of approaches to this question of pattern. One is algebraic, the other metric. In the algebraic approach, the aim is to reduce each sequence to some simplest form and to simply gather all sequences with similar "simplest forms" under one heading. In the metric approach, the analyst develops a measure of resemblance that gives the "distance" between any pair of sequences. These distances are then subjected to some standard classification method like scaling or clustering (on which see Arabie & Hubert 1992 for a recent review).

Many people are seeking ways to code whole sequence data. Often, the coding seems obvious, as in coding careers as simple lists of jobs; a career is simply the year-by-year or month-by-month list of jobs held. But when matters are more complex - as in attempts to code interaction or stories or in any case of parallel sequence tracks - varieties of schemes are endless, and could be the subject of their own review. There is a large literature in discourse analysis and related literatures on detailed formal representations of discourse and discourse-like processes. Poole (e.g., 1989) has done much work developing multi-level coding schemes for interaction, and indeed the Balesian Interaction Process Analysis tradition continues (Kosaka 1993, responding to Abell 1993). See also the extensive work of Bakeman & Gottman (1986). Trabasso (Trabasso & Nickels 1992) has developed a "causal network discourse analysis." Another new scheme has emerged in the sociology of science (Latour et al 1992, Scott 1992, Carlson and Gorman 1992), although it seems to this writer that the Latour et al project more or less reinvents network analysis. For yet another coding system (and software), see Carley (1991) and Carley & Palmquist (1992). Mishler (e.g., 1986, 1990) has done much work on interview coding and analysis. Franzosi (1990) and others have worked on the coding of stories from newspapers. Indeed, much of David Heise's ETHNO program is aimed to help investigators create a formally coded narrative structure for sequence data in interactive format.

What to do with the data once coded is another matter. In the one specific case of nonrecurrent sequences with complete data (each event observed once and only once in each sequence) there are a broad variety of permutational techniques available. These follow the metric approach above. They produce measures of resemblance between sequences, which the analyst may then scale or cluster in order to create categories which can then in turn serve as dependent or independent variables. Spearman correlation is such a measure of resemblance. Where one has nonrecurrent sequences with missing data (through time frame censoring or simple non-occurrence), the problem can be treated as a seriation or one-dimensional scaling problem, provided there are measures of distance or resemblance between sequence elements (For an example, see Abbott 1991). Arabie & Hubert (1992), however, warn of difficulties in such applications.

Once one moves to recurrent events, there are three explicit methodologies for considering sequence regularities of some complexity, all products of extensive theoretical and research programs. Two are algebraic in approach (David Heise and Peter Abell) and one is metric (Andrew Abbott). If the data involve branching and merging sequence lines, that is if they separable but simultaneous events that the analyst does not wish to treat as "combination events," the only possible choices at present are the methods of David Heise and Peter Abell. In the case of unilinear sequences, Abbott's methods are available and probably preferable. In reality, the three methods are aimed at different parts of the overall task of sequence analysis. Heise's methods are chiefly aimed at a rigorous and fully justified coding. (See his comments on Abell 1993). Abell's methods are chiefly aimed at algebraic classification of already coded narratives. Abbott's methods are aimed at metric classification of coded narratives, and while much more flexible than Abell's, are presently available only for the case of unilinear sequences.

David Heise's ETHNO system arose out of his work coding extended action and interaction patterns and looks to the theories of Fararo and Skvoretz (1984). It is embodied in an interactive software system (ETHNO: see Heise 1989, 1991) which is available from W. C. Brown publishers in Dubuque IA for $50. For reviews, see Griffin (1993) and Barnes (1993). ETHNO presumes that the analyst has a dataset composed of more or less extended narratives of particular stories that must be reduced to bare bones structures by creating a lexicon of actions and events and then formalizing the ways in which these actions and events proceed from one another. The formalization encodes the analyst's understanding of necessary relationships among the events to be placed in the narrative structure. ETHNO is well illustrated in Griffin's (1991) piece on lynching and Eder and Enke's (1991) equally challenging analysis of gossip. See also Heise's own applications, especially Corsaro & Heise (1990).

Far more than Abell and Abbott, Heise has focused on problems of semantic definition in sequence analysis. The end result is an exceptionally clear and empirically justified coding of the narrative, combined with the idea that generalization entails finding other narratives that "collapse" into the same primitive structure when the various narrative lines are simplified. (This is more or less the narrative homomorphism criterion of Abell, which Heise in commenting on Abell 1993 argues is the best criterion for sequence resemblance.)

Abell's homomorphism approach to sequence resemblance grew out of his own research on cooperatives and his extensive critique of standard methodologies (Abell 1987). The best recent exposition is Abell 1993, which has the advantage of being followed by lengthy responses from Heise, Fararo, Skvoretz, Abbott and other analysts of social sequence data. Like Heise's methods, Abell's are applicable to complex, network narratives. They should ultimately be available (in late 1995) in the UCNET system of programs, but at present are available from Martin Everett in the Department of Mathematics, Greenwich (England) University. Abell's methods presume, like Heise's, a dataset comprising various narratives of various degrees of complexity, but they focus more on the formal graphical structure of these narratives. They contain routines for reducing any coded narrative to a minimal homomorphic representation, which effectively allows one to further classify a set of Heise codings once they are developed within ETHNO. Abell (1993) has done a detailed analysis of narrrative data on consumer cooperatives.

Abbott's strategy for sequence analysis - optimal matching or optimal alignment as it is usually called - derives from a broad movement in the hard sciences, although it was stimulated by his work as a historical sociologist of occupations (Abbott 1988a) and his critique (similar to Abell's) of general linear models (Abbott has adapted derive from biology and from the pattern resemblance community in computer science. The classic text on these methods is Sankoff and Kruskal (1983). Recent primers are Heijne (1987) and Gribskov and Devereux (1992). Some recent developments are discussed in Abbott (1993).

The principal practical difference between Abbott's methods and Heise's or Abell's is that Abbott's are limited to unilinear sequences and proceed by metric means. They presume a dataset made of sequential lists of events like job careers or criminal careers. Under more or less parametric assumptions about the resemblance of individual events (jobs or crimes), the methods allow one to find resemblances between either entire sequences or subsequences within them. As with other metric techniques of sequence analysis, these resemblances are then input to scaling, clustering, and other categorization methods to uncover actual categories of patterns. Abbott's techniques are available in simple form in his OPTIMIZE program (from him, $20), although any publicly available biological sequence alignment program will do the same things, after some modification of its IO facilities.

Optimal matching has begun to see substantial use in social science. Abbott and Forrest 1986), on careers (Abbott and Deviney 1992). He has tested reliability in Forrest and Abbott (1990). Jones and Brain (1985) utilize the event distance coding characteristics of optimal matching. Coxon (1988) reports both an elaborate coding scheme and application of matching methods to sexual behavior. (There may be more complete publications from this study that I have not been able to find.) Miller & Roid (1993) apply optimal alignment to movement patterns in infants, while Levitt and Nass (1989) nicely use optimal matching to demonstrate consistency among physics and sociology textbooks. Saberwahl and Robey (1993) apply the techniques to find regularities in innovation processes, as does Poole (????).

One of the advantages of optimal matching/alignment as a sequence technique is that analysts whose theories predict different forms of resemblance can vary the algorithms to suit them. Thus, stage theories with some local disorder - a common position among stage theorists - can be sought using variants of the cellar algorithm (Wagner 1983). Theories implying the unimportance of minor shifts in duration of runs can be handled with variants of the affine gap-cost algorithm (Sankoff & Kruskal 1983:296ff). Subsequence problems across large datasets (the "turning point" question in life cycle data) can be handled with new local alignment algorithms using Gibbs samplers and expectation maximization (Lawrence et al 1993). More important, the unilinearity constraint is slowly being lifted. Even game theoretic sequences (which branch at each decision point) will probably be analyzable using the tree resemblance algorithms of Shasha and others (Zhang and Shasha 1989, see also Sankoff and Kruskal 1983: 265ff).

At present the measure theory of whole sequence data is preliminary indeed. In the highly specialized but fairly common situation of sequences embodying stages that all cases pass through in the same order, the methods of Collins et al (1988: a kind of generalized Guttmann methodology) may prove to be generalizable from their original of use as a generator of scaling items. But much depends on formal models of sequence generation and, probably more importantly, on the actual methodology chosen. Abbott (1984) raises a number of general measurement issues. Another extraordinarily interesting area is scheduling theory, for example, Birch (1984)'s extremely interesting paper on sequences generated by simple activitation time models. Scheduling models, like the Fararo/Skvoretz production models, may prove crucial to developing serious measurement analysis for sequence data. Also important may be artificially intelligent sequence-guessing programs. For an example, see Dietterich & Michalski 1985.

In summary, sequence methods seem to be poised at a point of major development. Once the vast array of biological algorithms can be harnessed to a reasonable sociological IO system, Abbott's empiricist sequence resemblance techniques should become a reasonable tool. The major problem then will be to think through ways of combining Heise's elegance of description, Abell's rigor of homomorphic classification, and Abbott's empirically practical use of metric techniques. Unsolved even in computer science, however, is the combination of the last two (analysis of network-structured sequence data with metric resemblance techniques), although see Abbott (1993) for speculations on this matter. Unsolved too, are the great difficulties of developing a serious measurement and statistical theory for sequence analysis. All these methods at present are largely heuristic. Only now are the biologists really facing the problems of statistical inference with sequences, problems which differ in many ways from the kinds of inferences sociologists may wish to draw. (For the biological literature, see Karlin et al. 1991 and Vingron and Waterman 1994)

Sequence analysis holds great promise for sociology. Most of our classic theories are sequential or interactional theories. The new methods address those theories directly. What the field needs now are junior scholars committed to exploring what sequence regularities can tell us about social life. The methods will not, to be sure, solve the great problems represented by interactional fields - where all sequences are mutually dependent and data are bewilderingly complex. But they will provide us with far more effective ways of analyzing life courses, careers, and other such relatively independent and regular sequences, ways of analyzing that accord directly with the canonical theoretical apparatus of the discipline.

Literatures Back to Top Literature Cited

Contact Information:
Address: 1126 E 59th St. Chicago, IL 60637
Office: (773) 702-4545 Fax: (773) 702-4849
Email to: a-abbott@uchicago.edu