Once RAPID has been installed, it can easily be used to do non-parametric linkage analysis using the S_pairs sharing statistic, or to assess the amount of sharing between all pairs of specified individuals for each marker. For both of these types of analyses RAPID must know the condensed identity coefficients of all quasi founders and their descendants. Below, I will first explain the basic inputs to RAPID that are necessary for all types of analyses. Second, I show how to use RAPID to find the quasi founders and their descendants. Third, I explain how to find the number of alleles shared for each pair of sample individuals at each marker. Fourth, I give instructions on how to do an affecteds-only (ie non-parametric) linkage analysis. Basic Usage =========== RAPID is run from the command line via the command 'rapid' with all information, including the type of anlaysis, specified by command line options. The first set of options are required every time 'rapid' is executed and require specifying various input files. These options are: -p pedigree file -g genotype file -f frequency file -s study sample file -d identity coefficient file -o output file If any of these command line arguments are not present, the program will request them from the user. Because of this it is also possible to store the file names in a file and use input redirection to supply the information. In this case, the file names must appear in the order listed above (order is irrelevant, though, when using command line arguments). So, for instance, you could enter: rapid < analysis1.input where analysis1.input is a file with contents: bigped.ped bigped.geno ancestral_pop.freq analysis1_sample bigped.idcoefs analysis1.output where each of these is a file with the proper format (see file_formats). Note that it is best to not mix-and-match these command line arguments with input redirection. Either specify all the files with command line arguments or all in a file. Unlike typical unix commands, the command line argument must have a space between the flag and the file names (i.e. it must appear as "-p pedfile" and not "-ppedfile"). In addition, the argument '-h' will print out a list of all the available arguments. Step 1: Find quasi founders =========================== After creating creating the pedigree, genotype and study sample files (see file_formats for their description), you will need to create the file with identity coefficients. RAPID does not compute these coefficients, but it can figure out the set of individuals for whom these coefficients are needed. Once this list is created you can run the IdCoefs software package to compute these coefficients (available from http://www.genes.uchicago.edu/abney.html, or do a Google search). When running RAPID to find this list, the program still needs you to input a frequency file and identity coefficient file, even though these files are ignored (ie these files can be empty, or non-existent as long as you enter something to the program). To find the list of quasi founders and descendants, use the '-q' command line option. The output from this command is in the specified output file with a '.clist' extension and has a list of all the IDs for which identity coefficients are needed. So, if you have the file 'analysis1.input', with contents as specified in the previous section, and the IdCoefs package already installed, you would type rapid -q < analysis1.input idcoefs -p bigped.ped -s analysis1.output.clist -o bigped.idcoefs -r 500 This will create the file 'bigped.idcoefs' with all the necessary identity coefficients and can be used directly in RAPID. The number 500 specifies the maximum amount of RAM in MB available for computation by idcoefs. In practice, this number will depend on the amount of RAM on your computer and the size of the pedigree. For pedigrees of a couple of hundred individuals 500 MB of RAM is probably sufficient (assuming your computer has this amount available). Also created by the first of the two Step 2: Run analyses ==================== There are two types of analyses that can currently be run with RAPID, estimate the number of alleles shared in pairs of study sample individuals, and perform non-parametric linkage analysis in the study sample individuals using the S_pairs statistic. In both types of analyses specifying accurate allele frequencies for the markers are important. If your frequency estimates are from the same genotyped set of individuals as you are including in your RAPID analysis (eg from the genotypes in bigped.geno in the analysis1.input example above), DO NOT USE THESE ESTIMATES IN RAPID! The reason is, if you provide the frequencies, RAPID assumes those are unbiased estimates of the true frequencies in the founder population and will not correct the S_pairs statistic for bias. Experience suggests that this will tend to have a negligible effect on the number of alleles shared for any particular pair, but can have a large effect on S_pairs (because it sums over many pairs). If you feel you have decent estimates of the founding population allele frequencies, either from the population itself or from a reasonable proxy for the population, then you are probably better off using these estimates instead of the ones that RAPID will compute. To have RAPID estimate the allele frequencies rather than read them in from a file, enter the number '0' for the frequency file (eg replace 'ancestral_pop.freq' with '0' in the 'analysis1.input' file). 2a: Estimate number of alleles shared IBD ========================================= To estimate the number of alleles shared IBD for the pairs specified in your study sample use the '-k' command line option. So, if you want RAPID to estimate the allele frequencies: rapid -k -p bigped.ped -s analysis1.sample -g bigped.geno -f 0 -d bigped.idcoefs -o analysis1.output or, rapid -k < analysis1.input where analysis1.input would have a '0' on its third line instead of a frequency file name. If you have good allele frequency estimates, put the name of the frequency file in place of the '0' either following the '-f' option or in 'analysis1.input', depending on which form you use. Note that when doing this analysis, RAPID will not attempt to correct for bias when it estimates the allele frequencies since this effect will normally be very small for each pair. Once RAPID has finished you will find a file named 'analysis1.output.nshared' which has a row for each pair of individuals and a column for each marker in the genotype file with the estimated number of alleles shared IBD at that marker. If the pair consists of two distinct individuals the number shared will be a real number in the range [0, 4]. The correct interpretation of this number is the expected value of S_pairs, for this pair, given the genotype data at this marker. If the individuals in the pair are the same individual, the number shared is in the range [0, 1]. 2b: Non-parametric linkage analysis =================================== If neither the '-q' nor the '-k' options are specified, then linkage analysis using the S_pairs statistic will be done. There are two different ways of obtaining p-values when doing this. The first method is fast, but only gives approximate (and sometimes extremely conservative) p-values, the second method takes much longer, but gives fully empiric p-values. If there are many markers to analyze, I recommend first finding approximate p-values for all markers, then selecting the 5-10 (or more depending on how long it takes and how much time you want to spend) best markers for which to do a fully empiric p-value analysis. In both cases, you will need to use the '-u' command line option to state the number of simulations to be done. To obtain approximate p-values, RAPID first computes the distribution of S_pairs assuming a perfectly informative marker based on N1 simulations and scales this to have a variance of 1.0. To get an approximate p-value of a marker, the S_pairs statistics at that marker is computed and then, based on a limited number of simulations N2, is adjusted for bias and scaled such that the statistic also has a variance of 1.0. The adjusted value of the statistic is then compared to the scaled, perfectly informative marker distribution to get an approximate p-value. The values of N1 and N2 are controlled with the '-u' command line option. To set N1 use '-u N1', then N2 is set to either 50 or N1, whichever is smaller. To set both N1 and N2 use '-u N1,N2'. Note that N2 cannot be set larger than N1. Probably a reasonable choice of N1 for a genomewide scan would be 100,000. In this case you would enter the command: rapid -u 100000 < analysis1.input or rapid -u 100000 -p bigped.ped -s analysis1.sample -g bigped.geno -f 0 -d bigped.idcoefs -o analysis1.output The results of the analysis using the above command would be in 'analysis1.output'. In this file there is one row for each marker with an initial header line labelin each of the seven columns: column 1: The name of the marker column 2: The approximate p-value of that marker column 3: The bias adjusted and scaled value of S_pairs (ie S_tilde) column 4: The bias adjusted value of S_pairs (ie b_spairs) column 5: The estimate of the bias (from N2 simulations) column 6: The scale factor to give a variance of 1.0 (from N2 simulations) column 7: S_pairs before adjusting for bias and scaling After obtaining the approximate p-values of all the markers you should select some number of best markers for which to get more accurate, empirical p-value estimates. To do that, create a new genotype file (e.g. best_markers.geno) and rerun the analysis using the '-e' command line option. You will also need to use '-u N1' to set the number of empirical simulations that should be used to estimate the p-value. A reasonable rule of thumb would be to set N1 to ten times the inverse of the smallest p-value you want an accurate estimate of. So, for instance, if you accurate estimates of p-values that are as small as 1e-5, use '-u 1000000'. You should be aware that to speed up computation, RAPID saves all the simulations in memory. If you have some combination of a large pedigree, very many simulations, and not much RAM, your computer may run out of memory. If this happens you may need to satisfy yourself with fewer simulations (or get more RAM). When computing the empiric p-value, RAPID will also compute and approximate p-value but this time based on N1 simulations (even if you put in a separate value of N2 e.g. '-u 1000000,100'). So, to run an empiric analysis you could issue the command: rapid -e -u 1000000 -p bigped.ped -s analysis1.sample -g best_markers.geno -f ancestral_pop.freq -d bigped.idcoefs -o best_markers.output The above command assumes you have decent estimates of the ancestral population allele frequencies. If you do not use '-f 0' instead. The results of the analysis using the above command can be found in the file 'best_markers.output'. This file has one row for each marker in 'best_markers.geno' along with an initial header line that labels each of the eight columns: column 1: The marker name column 2: The empiric p-value estimated from N1 simulations column 3: The approximate p-value of that marker column 4: The bias adjusted and scaled value of S_pairs (ie S_tilde) column 5: The bias adjusted value of S_pairs (ie b_spairs) column 6: The estimate of the bias (from N1 simulations) column 7: The scale factor to give a variance of 1.0 (from N1 simulations) column 8: S_pairs before adjusting for bias and scaling Additional command line options =============================== -a The '-a' command line option can be used to append the output to the file given following '-o'. In this case, the header line is not printed. This may be useful when accumulating many simulation analyses into a single file. Additional input and output files ================================= In order to do simulations, a number must be supplied to the program to seed the random number generator. Normally, the program will look at the clock time supplied by the computer. If, however, a file named 'seed' is present, the program will read a number from this file and use that to seed the generator. It may occasionally be the case that a marker with a Mendelian incompatibility is encountered. If this happens, RAPID will print an error message to the screen. Also, the marker name and the ID number of the individual around whom the error occurred is printed to a file whose name has the suffix '.mend_err' appended to the specified output file name. These markers are skipped and given an S_pairs value of zero. RAPID also creates a file with a name that is the output file name with '.spairs' appended to the end. In this file you can find the average value of the raw S_pairs (i.e. without bias correction or scaling) for each pair, where the average is over all markers. In addition, the kinship coefficient for each pair is given. If there were no bias, this value should be close to zero. An excess or deficit in genomewide sharing would be indicated by a positive or negative value. The values in this file can give you a rough idea about the overall sharing, but probably should not be used to do any rigorous inferences. When RAPID creates the list of individuals for whom identity coefficients are needed it also creates a file called 'quasifounders' which lists the IDs of the individuals who are quasi founders. This information is not used for anything other than your own personal edification.