![]() |
|
PROGRAM OPTIMIZELimitsThere is a strange bug in the program that keeps it to 150 sequences at present. I can't guarantee performance over that. As for the interelement distance matrix, that should go up to 90 different elements, beyond which (and even below which) you might want to think about collapsing categories or expressing categories as distances subject to euclidean measurement (and hence to direct calculation.) Note that these programs - like all dyadic analysis systems - are not the usual regression situation. The size of the problem to be analyzed goes up with the SQUARE of the number of cases, since the analysis proceeds by measuring distances between all cases. On the downstream side, scaling and clustering are the same kind of analysis systems, and many routines for them will vomit if you give them over a hundred cases. In any case, it is not clear what you are finding out when you do give them huge problems. In general, when you analyze large datasets (over 100 cases), you should break the data up into subproblems by using some variant of leader algorithm (See FASTCLUS in SAS. Leader algorithms are discussed in Hartigan's 1975 textbook of cluster algorithms). An equivalent procedure could be designed for optimal alignment (FASTCLUS takes only variable-based input), which I have not yet done, but you can approximate one by doing the following:
after you've repeated 4 and 5 for all remaining 150 sequence groups of data, COMBINE all seqs belonging to a single seed into small size datasets for further analysis. (This is like doing a leader algorithm by hand, but it will suffice.) If you were doing FASTCLUS there would be a "Kmeans" type pass here, swapping sequences from cluster to cluster trying to minimize the with cluster variance. In short, for the time being, these are techniques for small datasets in sociological terms. Don't expect the data to tell you what to do. Theorize. |
|
|
| |
|||
|
Contact Information: |
|||