Once RAPID has been installed, it can easily be used to do
non-parametric linkage analysis using the S_pairs sharing statistic,
or to assess the amount of sharing between all pairs of specified
individuals for each marker. For both of these types of analyses RAPID
must know the condensed identity coefficients of all quasi founders
and their descendants. Below, I will first explain the basic inputs to
RAPID that are necessary for all types of analyses. Second, I show how
to use RAPID to find the quasi founders and their descendants. Third,
I explain how to find the number of alleles shared for each pair of
sample individuals at each marker. Fourth, I give instructions on how
to do an affecteds-only (ie non-parametric) linkage analysis.

Basic Usage
===========

RAPID is run from the command line via the command 'rapid' with all
information, including the type of anlaysis, specified by command line
options. The first set of options are required every time 'rapid' is
executed and require specifying various input files. These options
are:

-p pedigree file
-g genotype file
-f frequency file
-s study sample file
-d identity coefficient file
-o output file

If any of these command line arguments are not present, the program
will request them from the user. Because of this it is also possible
to store the file names in a file and use input redirection to supply
the information. In this case, the file names must appear in the order
listed above (order is irrelevant, though, when using command line
arguments). So, for instance, you could enter:

rapid < analysis1.input

where analysis1.input is a file with contents:

bigped.ped
bigped.geno
ancestral_pop.freq
analysis1_sample
bigped.idcoefs
analysis1.output

where each of these is a file with the proper format (see
file_formats). Note that it is best to not mix-and-match these command
line arguments with input redirection. Either specify all the files
with command line arguments or all in a file.

Unlike typical unix commands, the command line argument
must have a space between the flag and the file names (i.e. it must
appear as "-p pedfile" and not "-ppedfile"). In addition, the argument
'-h' will print out a list of all the available arguments.

Step 1: Find quasi founders
===========================

After creating creating the pedigree, genotype and study sample files
(see file_formats for their description), you will need to create the
file with identity coefficients. RAPID does not compute these
coefficients, but it can figure out the set of individuals for whom
these coefficients are needed. Once this list is created you can run
the IdCoefs software package to compute these coefficients (available
from http://www.genes.uchicago.edu/abney.html, or do a Google
search). When running RAPID to find this list, the program still needs
you to input a frequency file and identity coefficient file, even
though these files are ignored (ie these files can be empty, or
non-existent as long as you enter something to the program). To find
the list of quasi founders and descendants, use the '-q' command line
option. The output from this command is in the specified output file
with a '.clist' extension and has a list of all the IDs for which
identity coefficients are needed. So, if you have the file
'analysis1.input', with contents as specified in the previous section,
and the IdCoefs package already installed, you would type

rapid -q < analysis1.input
idcoefs -p bigped.ped -s analysis1.output.clist -o bigped.idcoefs -r 500

This will create the file 'bigped.idcoefs' with all the necessary
identity coefficients and can be used directly in RAPID. The number
500 specifies the maximum amount of RAM in MB available for
computation by idcoefs. In practice, this number will depend on the
amount of RAM on your computer and the size of the pedigree. For
pedigrees of a couple of hundred individuals 500 MB of RAM is probably
sufficient (assuming your computer has this amount available). Also
created by the first of the two 

Step 2: Run analyses
====================

There are two types of analyses that can currently be run with RAPID,
estimate the number of alleles shared in pairs of study sample
individuals, and perform non-parametric linkage analysis in the study
sample individuals using the S_pairs statistic. In both types of
analyses specifying accurate allele frequencies for the markers are
important. If your frequency estimates are from the same genotyped set of
individuals as you are including in your RAPID analysis (eg from the
genotypes in bigped.geno in the analysis1.input example above), DO NOT
USE THESE ESTIMATES IN RAPID! The reason is, if you provide the
frequencies, RAPID assumes those are unbiased estimates of the true
frequencies in the founder population and will not correct the S_pairs
statistic for bias. Experience suggests that this will tend to have a
negligible effect on the number of alleles shared for any particular
pair, but can have a large effect on S_pairs (because it sums over
many pairs). If you feel you have decent estimates of the founding
population allele frequencies, either from the population itself or
from a reasonable proxy for the population, then you are probably
better off using these estimates instead of the ones that RAPID will
compute. To have RAPID estimate the allele frequencies rather than
read them in from a file, enter the number '0' for the frequency file
(eg replace 'ancestral_pop.freq' with '0' in the 'analysis1.input'
file). 

2a: Estimate number of alleles shared IBD
=========================================

To estimate the number of alleles shared IBD for the pairs specified
in your study sample use the '-k' command line option. So, if you want
RAPID to estimate the allele frequencies:

rapid -k -p bigped.ped -s analysis1.sample -g bigped.geno -f 0 -d bigped.idcoefs -o analysis1.output

or,

rapid -k < analysis1.input

where analysis1.input would have a '0' on its third line instead of a
frequency file name. If you have good allele frequency estimates, put
the name of the frequency file in place of the '0' either following
the '-f' option or in 'analysis1.input', depending on which form you
use. Note that when doing this analysis, RAPID will not attempt to
correct for bias when it estimates the allele frequencies since this
effect will normally be very small for each pair.

Once RAPID has finished you will find a file named
'analysis1.output.nshared' which has a row for each pair of
individuals and a column for each marker in the genotype file with the
estimated number of alleles shared IBD at that marker. If the pair
consists of two distinct individuals the number shared will be a real
number in the range [0, 4]. The correct interpretation of this number
is the expected value of S_pairs, for this pair, given the genotype
data at this marker. If the individuals in the pair are the same
individual, the number shared is in the range [0, 1].

2b: Non-parametric linkage analysis
===================================

If neither the '-q' nor the '-k' options are specified, then linkage
analysis using the S_pairs statistic will be done. There are two
different ways of obtaining p-values when doing this. The first method
is fast, but only gives approximate (and sometimes extremely
conservative) p-values, the second method takes much longer, but gives
fully empiric p-values. If there are many markers to analyze, I
recommend first finding approximate p-values for all markers, then
selecting the 5-10 (or more depending on how long it takes and how
much time you want to spend) best markers for which to do a fully
empiric p-value analysis. In both cases, you will need to use the '-u'
command line option to state the number of simulations to be done.

To obtain approximate p-values, RAPID first computes the distribution
of S_pairs assuming a perfectly informative marker based on N1
simulations and scales this to have a variance of 1.0. To get an
approximate p-value of a marker, the S_pairs statistics at that marker
is computed and then, based on a limited number of simulations N2, is
adjusted for bias and scaled such that the statistic also has a
variance of 1.0. The adjusted value of the statistic is then compared
to the scaled, perfectly informative marker distribution to get an
approximate p-value. The values of N1 and N2 are controlled with the
'-u' command line option. To set N1 use '-u N1', then N2 is set to
either 50 or N1, whichever is smaller. To set both N1 and N2 use 
'-u N1,N2'. Note that N2 cannot be set larger than N1. Probably a
reasonable choice of N1 for a genomewide scan would be 100,000. In
this case you would enter the command:

rapid -u 100000 < analysis1.input

or

rapid -u 100000 -p bigped.ped -s analysis1.sample -g bigped.geno -f 0 -d bigped.idcoefs -o analysis1.output

The results of the analysis using the above command would be in
'analysis1.output'. In this file there is one row for each marker with
an initial header line labelin each of the seven columns:
column 1: The name of the marker
column 2: The approximate p-value of that marker
column 3: The bias adjusted and scaled value of S_pairs (ie S_tilde)
column 4: The bias adjusted value of S_pairs (ie b_spairs)
column 5: The estimate of the bias (from N2 simulations)
column 6: The scale factor to give a variance of 1.0 (from N2 simulations)
column 7: S_pairs before adjusting for bias and scaling

After obtaining the approximate p-values of all the markers you should
select some number of best markers for which to get more accurate,
empirical p-value estimates. To do that, create a new genotype file
(e.g. best_markers.geno) and rerun the analysis using the '-e' command
line option. You will also need to use '-u N1' to set the number of
empirical simulations that should be used to estimate the p-value. A
reasonable rule of thumb would be to set N1 to ten times the inverse
of the smallest p-value you want an accurate estimate of. So, for
instance, if you accurate estimates of p-values that are as small as
1e-5, use '-u 1000000'. You should be aware that to speed up
computation, RAPID saves all the simulations in memory. If you have
some combination of a large pedigree, very many simulations, and not
much RAM, your computer may run out of memory. If this happens you may
need to satisfy yourself with fewer simulations (or get more RAM). 
When computing the empiric p-value, RAPID will also compute and
approximate p-value but this time based on N1 simulations (even if you
put in a separate value of N2 e.g. '-u 1000000,100'). So, to run an
empiric analysis you could issue the command:

rapid -e -u 1000000 -p bigped.ped -s analysis1.sample -g best_markers.geno -f ancestral_pop.freq -d bigped.idcoefs -o best_markers.output

The above command assumes you have decent estimates of the ancestral
population allele frequencies. If you do not use '-f 0' instead. The
results of the analysis using the above command can be found in the
file 'best_markers.output'. This file has one row for each marker in
'best_markers.geno' along with an initial header line that labels each
of the eight columns:
column 1: The marker name
column 2: The empiric p-value estimated from N1 simulations
column 3: The approximate p-value of that marker
column 4: The bias adjusted and scaled value of S_pairs (ie S_tilde)
column 5: The bias adjusted value of S_pairs (ie b_spairs)
column 6: The estimate of the bias (from N1 simulations)
column 7: The scale factor to give a variance of 1.0 (from N1 simulations)
column 8: S_pairs before adjusting for bias and scaling

Additional command line options
===============================

-a  The '-a' command line option can be used to append the output to
    the file given following '-o'. In this case, the header line is
    not printed. This may be useful when accumulating many simulation
    analyses into a single file.

Additional input and output files
=================================

In order to do simulations, a number must be supplied to the program
to seed the random number generator. Normally, the program will look
at the clock time supplied by the computer. If, however, a file named
'seed' is present, the program will read a number from this file and
use that to seed the generator.

It may occasionally be the case that a marker with a Mendelian
incompatibility is encountered. If this happens, RAPID will print an
error message to the screen. Also, the marker name and the ID number
of the individual around whom the error occurred is printed to a file
whose name has the suffix '.mend_err' appended to the specified output
file name. These markers are skipped and given an S_pairs value of
zero. 

RAPID also creates a file with a name that is the output file name
with '.spairs' appended to the end. In this file you can find the
average value of the raw S_pairs (i.e. without bias correction or
scaling) for each pair, where the average is over all markers. In
addition, the kinship coefficient for each pair is given. If there
were no bias, this value should be close to zero. An excess or deficit
in genomewide sharing would be indicated by a positive or negative
value. The values in this file can give you a rough idea about the
overall sharing, but probably should not be used to do any rigorous
inferences.

When RAPID creates the list of individuals for whom identity
coefficients are needed it also creates a file called 'quasifounders'
which lists the IDs of the individuals who are quasi founders. This
information is not used for anything other than your own personal
edification.