Programs for carrying out permutation tests [ Attention: The version of snn.c available on this site before 9/18/2000 has a bug which resulted in incorrect results. Ignore all results based on earlier versions of snn.c. The published results concerning power were based on a different version which is ok. ] This directory contains software for carrying out permutation tests for detecting geographic subdivision when one has samples from two or more localities. The two sample tests are those described in Hudson, Boos and Kaplan 1992 MBE 9: 138-151, except that the chi-square tests with lumping are not done by these programs. Instead the permutation test based on a chi-square statistic without lumping is done as suggested by Roff and Bentzen 1989 MBE 6: 539-45. For more than two localities the extension of the Roff and Bentzen test is obvious. For the other tests the extension is as follows. Recall that the tests for two localities was based on Ks = wK1 + (10w)k2, where w is a weighting that depends on the sample sizes from the two localities. (This is for the test based on number of differences, the other tests will extend in an analogous fashion.) For more localities, the test is based on Ks = Sum( wi*Ki ), where Ki is the mean distance between pairs of sequences from locality i, and where wi is a weighting factor based on the sample size. wi are calculated as follows wi = (ni-wf)/(ntotal -wf*nlocs ), where wf is a parameter to be specified by the user. (Usually wf will be 0 or 2, see below.) The p-value estimated by the program is the proportion of permuted samples with Ks less than or equal to the observed value. Also included in this directory is a program, snn.c, to carry out the "nearest neighbor test" as described in Hudson (2000) Genetics 155: 2011-2014. Files to retrieve To carry out the tests described here, you will need to retrieve all the *.c files, plus the files dotests_mat and dotests_seq. The file "compilations" indicate how to compile and link the programs. There are two sample data files, sample.dat and seq.dat. Two kinds of data files can be used for these tests. The data can be in the form of matrix of pairwise differences between all sampled genes. The genes from each locality being grouped in blocks in the matrix. An example of such a data file is sample.dat . A more complete description of the data format is given below. The other possible data format is a file with the sequence of each sampled gene given. The format for data of this type is given in the source file seqtomatrix.c, which codes for program to convert the list of sequences to a matrix of differences. seq.dat is an example data file with sequence data. Compiling and Linking The programs should be compiled and linked as indicated in the file "compilation" in this directory. Your machine may require some additional flags or switches. I think the code is fairly standard and should port to most systems ok. Some machines may require different header files ( e.g. malloc.h etc ).or not have some files ( e.g. ctype.h ). flags or switches. Using the programs The program for doing the basic permutation test is permtest. A list of command line arguments for this (and most the other programs) can be obtained by invoking them without arguments. Here is the usage of permtest: permtest stat wf y/n n_permuts n_locals n1 n2 ...outifle where stat: This is just the symbol that will be used in the output to indicate the kind of statistic that is being used. The appropriate symbol depends on the distance measure that is used in the data file. Thus the output will label the observed values of the statistics with "stat"st "stat"s and "stat"t, where the quoted part will be replaced by whatever was given as first argument. If the data file contains pairwise differences between sequences, then K is an appropriate first argument. It is important to note that the first argument has no effect on the test, the observed values , or the p-values, it only determines what label is printed along side the results. wf: This is used in determining the weighting factor that is used to average the within locality measures of distance. The weighting factor is the "w" that appears in equation (10) of Hudson, Boos and Kaplan MBE 9:138-151 (1992). For the statistic Ks, they recommend setting w = n1/(n1+n2). For other statistics, they recommend w = (n1-2)/(n1+n2-4). The program uses wi, for weighting the "within distance measure" in the ith local (which has sample size ni), where wi = (ni-wf)/(ntotal -nloc*wf) . If there are two localities and wf is set to zero this gives the weighting recommended by Hudson et al for Ks. If wf is set equal to two, then the resulting weighting (for two locality case) gives the weighting recommended Summarizing, use 0 for wf when testing with Ks, otherwise use 2. y/n: A y here indicates that the input data matrix should be output. n_permuts: The number of permutations to carry out. Usually 1000 or more. n_locals: The number of localities from which the samples were taken. n1 n2 ... : The sample sizes from each of the sample localities. infile: The data file (with the matrix of differences. outfile: The name of the output file. For example: permtest K 0 n 1000 2 20 27 test.out will permform the permutation test using the data matrix in sample.dat, in which the elements of the matrix are number of differences between the sampled sequences. Since the input consists of pairwise number of diffences the resulting statistics are Ks, Kst and Kt, and hence the output will be so labeled since the first argument is K. The second argument sets wf to zero, so the weighting will be ni/ntotal for the ith local. The third argument "n" instructs the program to refrain from outputting the input data matrix. The number of permutations carried out is 1000, as specified by the fourth argument. The number of localities sampled is 2, and 20 sequences were from locality one and 27 were from locality two. The results were directed to the file test.out. The contents of file test.out should look like: newt.rhudson{415} cat test.out Sample configuration: 20 27 Number of permutations: 1000 weighting constant: 0 Observed values of statistics: Kst: 0.001133 , Ks: 7.161168 Kt: 7.169288 ( p-value: 0.363000) newt.rhudson{416} Other tests can be performed by transforming the pairwise difference matrix in various ways and then using permtest. For example logtrans test.out will first transform the data matrix into a matrix which contains elements equal to the log(1+dij), where dij is the number of differences between sequence i and sequence j. The statistics based on this measure of distance between sequences were designated K*s, etc in Hudson, et al, (see equation (11) ) , and hence it is helpful to label the output as such. Hence the first argument is given as 'K*'. (The single quotes may be necessary so that the shell doesn't do anything weird with the asterisk.) The file dotests_mat will carry out most of the tests described by Hudson et al, by transforming the data matrix in the appropriate ways. These programs will not do the chi-square tests with lumping of classes. Here is an example of using dotests_mat: dotests_mat sample.dat 1000 2 20 27 > tests.out The program permchi carries out the permutation test of Roff and Bentzen, using the chi-square statistic, without any lumping of types. This test will also be carried out by the dotests_m program. The chi-square tests with lumping to get minimum expected numbers are not carried out by these programs. The program seqtomatrix will convert data files with sequences (in the appropriate format) to a file with a matrix of pairwise differences. For example, seqtomatrix 47 41 mat.dat will use the 47 sequences (each 41 sites long) to create a pairwise difference matrix in the file mat.dat, which can be used as input to the permutation test programs. The shell program in dotests_seq will utilize a file of sequences to do the same tests as dotests_mat . Data format: The data required is a matrix of number of pairwise differences between all sampled copies of the gene. The format required is the default format generated by MEGA except the first line should read: Number of sequences nn where nn is the number of sequences. (See the sample.dat file for an example.) The matrix should be upper triangular. If n1 denotes the sample size from locality one, and n2 the sample size from locality two, the first n1 rows of the matrix are assumed to give distances between locality one genes and other genes. ( In other words the element d(i,j) of the matrix gives the number of differences between sequence i and sequence j, where sequence i is from locality one if i less than or equal to n1, and is from locality two if i is greater than n1.) The program will ignore all lines in the file until the first 4 characters on a line are OTUs . (The program is case sensitive.) On the same line after the word "OTUs" there must follow n blank-space-separated words (MEGA does this, and the words are in this case just the number labels for the OTUs ). (n is the total sample size, except if the matrix is split in which case n is the number of columns displayed.) Then the following lines must each start with a word (an OTU label) followed by the row of matrix elements ( remember just the upper triangular part of the matrix.) An example data file: Number of sequences 7 Input File Name: C:\MEGA\PLORG12.MEG on Thu Nov 03 11:12:14 1994 Input Data: DNA No. of Otus Used: 7 of 14 Distance: Number of different nucleotides. No. of Nucleotides in Subset: 1530 of 1530 Gap Sites and Missing Information Data: All such sites were removed from the subset data OTU Labels 1.. Tpa 2.. Bbo 3.. Bca 4.. Tta 5.. Cwr 6.. Nca 7.. Perk Distances in the upper-right matrix '*' indicates an invalid distance value OTUs 1 2 3 4 5 6 7 1 129.0000 66.0000 1.0000 105.0000 86.0000 115.0000 2 122.0000 129.0000 180.0000 167.0000 195.0000 3 65.0000 125.0000 107.0000 135.0000 4 106.0000 85.0000 115.0000 5 98.0000 113.0000 6 116.0000 7 All that is really needed by permtest are the last 8 lines. The split matrix format that MEGA produces when the page width is less than is needed to print the matrix will also work as input to permtest. Random number generation The last 10 lines or so of the program contain the subroutine ran1(). This subroutine generates a pseudorandom uniform on the interval 0,1 by calling the function rand(), which is often in the standard library of C subroutines provided with a C compiler. rand() is not necessarily a great generator and I make no claims for it. On Unix machines you may wish to use drand48() which may have better properties. This could be done by replacing the last 10 lines of the source file (which contain the code for ran1() ) by : double ran1() { double drand48(); return( drand48() ) ; } Or you may wish to use one of the portable random number generators described in Press, W. H., B. P. Flannery, S. A. Teukolsky and W. T. Vetterling, 1988 Numerical Recipes in C. Cambridge University Press, Cambridge. If you retreive this program from my ftp directory, please let me know. If you successfully compile and run them, let me know. If you have troubles, also please let me know. My email address is rr-hudson@uchicago.edu .