Programs for carrying out permutation tests

[ Attention:  The version of snn.c available on this site  before 9/18/2000 has a bug
which resulted in incorrect results.  Ignore all results based on earlier
versions of snn.c.  The published results concerning power were based on a 
different version which is ok. ]

This directory contains software for carrying out permutation tests for 
detecting geographic subdivision when one has samples from two 
or more localities.  The two sample tests are those described in Hudson,
Boos and Kaplan 1992 MBE 9: 138-151, except that the chi-square tests with lumping
are not done by these programs.  Instead the permutation test based on a
chi-square statistic without lumping is done as suggested by Roff and 
Bentzen 1989 MBE 6: 539-45.  For more than two localities the extension of
the Roff and Bentzen test is obvious.  For the other tests the extension is
as follows. Recall that the tests for two localities was based on 
Ks = wK1 + (10w)k2, where w is a weighting that depends on the sample sizes from 
the two localities.  (This is for the test based on number of differences, the other 
tests will extend in an analogous fashion.)  For more localities, the test is based
on
   Ks = Sum( wi*Ki ),
where Ki is the mean distance between pairs of sequences from 
locality i, and where wi is a weighting factor based on the sample size.  wi are
calculated as follows  wi = (ni-wf)/(ntotal -wf*nlocs ), where wf is a parameter 
to be specified by the user. (Usually wf will be 0 or 2, see below.)  The 
p-value estimated by the program is the proportion of permuted samples with
Ks less than or equal to the observed value.  
	Also included in this directory is a program, snn.c, to carry out the
"nearest neighbor test" as described in Hudson (2000) Genetics 155: 2011-2014. 

Files to retrieve

 To carry out the tests described here, you
will need to retrieve all the *.c files, plus the files dotests_mat and
dotests_seq. The file "compilations" indicate how to compile and link
the programs. There are two sample data files, sample.dat and seq.dat.

 Two kinds of data files can be used for these tests.  The data can be
in the form of matrix of pairwise differences between all sampled genes. 
The genes from each locality being grouped in blocks in the matrix.  An example
of such a data file is sample.dat .  A more complete description of the 
data format is given below.  The other possible data format is a file with
the sequence of each sampled gene given.  The format for data of this
type is given in the source file seqtomatrix.c, which codes for program to
convert the list of sequences to a matrix of differences. seq.dat is an
example data file with sequence data.

Compiling and Linking

  The programs should be compiled and linked as indicated in the file
"compilation" in this directory.  Your machine may require some additional
flags or switches. I think the code is fairly standard and should port to 
most systems ok.  Some machines may require different header files ( e.g. 
malloc.h etc ).or not have some files ( e.g. ctype.h ).
flags or switches. 

Using the programs

   The program for doing the basic permutation test is
permtest. A list of command line arguments for this (and most the other
 programs) can be obtained by invoking them without arguments. Here is
the usage of permtest: 

permtest  stat wf y/n n_permuts n_locals n1 n2 ...<infile >outifle 

where
   stat: This is just the symbol that will be used in the output to 
indicate the kind of statistic that is being used. The appropriate
symbol depends on the distance measure that is used in the data file.
Thus the output will label the observed values of the statistics
with "stat"st   "stat"s and "stat"t, where the quoted part will be 
replaced by whatever was given as first argument.  If the data file
contains pairwise differences between sequences, then K is an appropriate
first argument.  It is important to note that the first argument has no
effect on the test, the observed values , or the p-values, it only determines
what label is printed along side the results.
   wf:  This is used in determining the weighting factor that is used 
to average the within locality measures of distance. The weighting factor
is the "w" that appears in equation (10) of Hudson, Boos and  Kaplan 
MBE 9:138-151 (1992).  For the statistic Ks, they recommend setting 
w = n1/(n1+n2).  For other statistics, they recommend w = (n1-2)/(n1+n2-4).
The program uses wi, for weighting the "within distance measure" in the ith
local (which has sample size ni), where wi = (ni-wf)/(ntotal -nloc*wf)  .  If
there are two localities and wf is set to zero this gives the weighting
recommended by Hudson et al for Ks.  If wf is set equal to two, then the
resulting weighting (for two locality case) gives the weighting recommended
Summarizing, use 0 for wf when testing with Ks, otherwise use 2.
  y/n:  A y here indicates that the input data matrix should be output.
  n_permuts: The number of permutations to carry out. Usually 1000 or more.
  n_locals: The number of localities from which the samples were taken.
  n1 n2 ... : The sample sizes from each of the sample localities.
  infile: The data file (with the matrix of differences.
  outfile: The name of the output file.

For example:
   permtest K 0 n 1000 2 20 27 <sample.dat >test.out

will permform the permutation test using the data matrix in sample.dat,
in which the elements of the matrix are number of differences between the
sampled sequences. Since the input consists of pairwise number of diffences
the resulting statistics are Ks, Kst and Kt, and hence the output will be 
so labeled since the first argument is K.  The second argument sets wf to 
zero, so the weighting will be ni/ntotal for the ith local.  The third
argument "n" instructs  the program to refrain from outputting the input
data matrix.  The number of permutations carried out is 1000, as specified
by the fourth argument.  The number of localities sampled is 2, and 20
sequences were from locality one and 27 were from locality two. The results
were directed to the file test.out.  The contents of file test.out
should look like:
newt.rhudson{415} cat test.out

Sample configuration: 20  27
Number of permutations: 1000  weighting constant: 0
Observed values of statistics:
 Kst: 0.001133 , Ks: 7.161168  Kt: 7.169288 ( p-value: 0.363000)

newt.rhudson{416} 

Other tests can be performed by transforming the pairwise difference 
matrix in various ways and then using permtest.  For example

logtrans <sample.dat | permtest 'K*' 2 n 1000 2 20 27 >test.out

will first transform the data matrix into a matrix which contains
elements equal to the log(1+dij), where dij is the number of differences
between sequence i and sequence j.  The statistics based on this measure
of distance between sequences were designated K*s, etc in Hudson, et
al, (see equation (11) ) , and hence it is helpful to label the output
as such. Hence the first argument is given as 'K*'.  (The single quotes
may be necessary so that the shell doesn't do anything weird with the
asterisk.) 

The file dotests_mat  will carry out most of the tests described by 
Hudson et al, by transforming the data matrix in the appropriate ways.
These programs will not do the chi-square tests with lumping of 
classes.  Here is an example of using dotests_mat:
   dotests_mat sample.dat 1000 2 20 27 > tests.out

The program permchi carries out the permutation test of Roff and Bentzen, 
using the chi-square statistic, without any lumping of types.  This test
will also be carried out by the dotests_m program. The chi-square tests
with lumping to get minimum expected numbers are not carried out by these
programs.

The program seqtomatrix will convert data files with sequences (in the
appropriate format) to a file with a matrix of pairwise differences.
For example,

seqtomatrix 47 41 <seq.dat >mat.dat

will use the 47 sequences (each 41 sites long) to create a pairwise
difference matrix in the file mat.dat, which can be used as input to 
the permutation test programs.  The shell program in dotests_seq will
utilize a file of sequences to do the same tests as dotests_mat .


Data format:
	
The data required is a matrix of number of pairwise differences between 
all sampled copies of the gene. 
The format required is the default 
format generated by MEGA  except the first line should read:
   Number of sequences nn
where nn is the number of sequences. (See the sample.dat file for an example.)
 The matrix should be upper 
triangular. If n1 denotes the sample size from locality one, and n2 the 
sample size from locality two, the first  n1 rows of the matrix are 
assumed to give distances between locality one genes and  other genes. ( 
In other words the element d(i,j) of the matrix gives the number of 
differences between sequence i and sequence j, where sequence i is from 
locality  one if i less than or equal to n1, and is from locality two if 
i is greater than n1.)  The program will ignore all lines in the file 
until the first 4 characters on a line are OTUs . (The program is case 
sensitive.)  On the same line after the word "OTUs"  there must follow n 
blank-space-separated words (MEGA does this, and the words are in this 
case just the number labels for the OTUs ). (n is the total sample size, 
except if the matrix is split in which case n is the number of columns 
displayed.)  Then the following lines must each start with a word (an OTU 
label) followed by the row of  matrix elements ( remember just the upper 
triangular part of the matrix.) 
	
 An example data file:
	
Number of sequences 7 
Input File Name: C:\MEGA\PLORG12.MEG  on  Thu Nov 03 11:12:14 1994

Input Data: DNA
No. of Otus Used: 7  of 14

Distance:  Number of different nucleotides.
No. of Nucleotides in Subset: 1530 of 1530
Gap Sites and Missing Information Data: All such sites were removed from 
the subset data

 OTU Labels
   1..  Tpa
   2..  Bbo
   3..  Bca
   4..  Tta
   5..  Cwr
   6..  Nca
   7..  Perk

Distances in the upper-right matrix
'*' indicates an invalid distance value

OTUs     1         2         3         4         5         6         7         
  1            129.0000   66.0000    1.0000  105.0000   86.0000  115.0000  
  2                      122.0000  129.0000  180.0000  167.0000  195.0000  
  3                                 65.0000  125.0000  107.0000  135.0000  
  4                                          106.0000   85.0000  115.0000  
  5                                                     98.0000  113.0000  
  6                                                              116.0000  
  7                                                                        
	
 
All that is really needed by permtest are the last 8 lines.  The split 
matrix format that MEGA produces when the page width is less than is 
needed to print the matrix will also work as input to permtest.


Random number generation

   The last 10 lines or so of the program contain the subroutine ran1().  
This subroutine generates a pseudorandom uniform on the interval 0,1 by 
calling the function rand(), which is often in the standard library of C 
subroutines provided with a C compiler.  rand() is not necessarily a 
great generator and I make no claims for it.  On Unix machines you may 
wish to use drand48() which may have better properties. This could be 
done by replacing the last 10 lines of the source file (which contain the 
code for ran1() ) by : 

   double
ran1()
{
double drand48();
    return( drand48() ) ;
} 

Or you may wish to use one of the portable random number generators 
described in 
	
Press, W. H., B. P. Flannery, S. A. Teukolsky and W. T. Vetterling, 1988   
Numerical Recipes in C.  Cambridge University Press, Cambridge.


If you retreive this program from my ftp directory, please let me know.  If
you successfully compile and run them, let me know.  If you have troubles, 
also please let me know.  My email address is rr-hudson@uchicago.edu  .