optimal matching back to optimal matching main page

PROGRAM OPTIMIZE

Andrew Abbott

Limits

There is a strange bug in the program that keeps it to 150 sequences at present. I can't guarantee performance over that. As for the interelement distance matrix, that should go up to 90 different elements, beyond which (and even below which) you might want to think about collapsing categories or expressing categories as distances subject to euclidean measurement (and hence to direct calculation.)

Note that these programs - like all dyadic analysis systems - are not the usual regression situation. The size of the problem to be analyzed goes up with the SQUARE of the number of cases, since the analysis proceeds by measuring distances between all cases. On the downstream side, scaling and clustering are the same kind of analysis systems, and many routines for them will vomit if you give them over a hundred cases. In any case, it is not clear what you are finding out when you do give them huge problems. In general, when you analyze large datasets (over 100 cases), you should break the data up into subproblems by using some variant of leader algorithm (See FASTCLUS in SAS. Leader algorithms are discussed in Hartigan's 1975 textbook of cluster algorithms). An equivalent procedure could be designed for optimal alignment (FASTCLUS takes only variable-based input), which I have not yet done, but you can approximate one by doing the following:

  1. run a small (n<150) dataset through OPTIMIZE,
  2. cluster it with any clustering algorithm
  3. find a subset of 20 or so sequences in the data just analyzed that effectively "cover the clusters" (so there is one sequence in each important cluster throughout the clustering. Technically you want centroids, but a central sequence will do. The idea is to have a "seed" sequence in every region of the sequence space.)
  4. insert these into OPTIMIZE with the next group of data
  5. classify each new sequence as being in the "cluster belonging to the nearest of these 'seeds'" This means writing a program to scan the output matrix for this new run seeking the nearest seed to each of the NEW sequences in this dataset.

after you've repeated 4 and 5 for all remaining 150 sequence groups of data, COMBINE all seqs belonging to a single seed into small size datasets for further analysis. (This is like doing a leader algorithm by hand, but it will suffice.)

If you were doing FASTCLUS there would be a "Kmeans" type pass here, swapping sequences from cluster to cluster trying to minimize the with cluster variance.

In short, for the time being, these are techniques for small datasets in sociological terms. Don't expect the data to tell you what to do. Theorize.

Implementation Back to Top Final Matters

Contact Information:
Address: 1126 E 59th St. Chicago, IL 60637
Office: (773) 702-4545 Fax: (773) 702-4849
Email to: a-abbott@uchicago.edu