polyRAD
polyRAD copied to clipboard
Export to and import from updog
It would be nice to have some convenience functions to convert between a RADdata
object and the input and output of updog
. This would allow users to take advantage of the file import and export options in polyRAD, while performing the genotype calling itself in updog (more accurate than polyRAD in some cases but much slower).
If you would like to add this feature and make a pull request, just comment here and I will give any help and guidance that I can. In particular see the multidog
and format_multidog
functions in updog. See also the checklist for pull requests.
@lvclark can I work on this issue?
@nk183 Sure, thanks for doing this!
Going from polyRAD
to updog
- I recommend outputting a named list with names
refmat
,sizemat
, andploidy
, to make it clear how the output can be passed to the arguments ofmultidog
. -
refmat
should come from the$alleleDepth
slot of theRADdata
object. You can use theOneAllelePerMarker
function withcommonAlleles = TRUE
to remove the common allele of each locus to avoid duplicate computations inupdog
. You will also need to transpose the matrix so alleles are in rows and taxa are in columns. -
sizemat
can either be derived from$alleleDepth + $antiAlleleDepth
or from$locDepth
. It needs to have the same dimensions and names asrefmat
. -
ploidy
can come from$possiblePloidies
although there will probably need to be an argument to specify ploidy as well, sincepolyRAD
supports multiple ploidies andupdog
does not.
Going from updog
to polyRAD
multidog
outputs a list of two items. inddf
has most of the information that is needed:
- The
snp
andind
columns can be used to generate column names and row names, respectively, foralleleDepth
. - The
ref
andsize
columns can be used to generate the values foralleleDepth
. - There needs to be a way to indicate or determine which "SNPs" actually belong to one locus. This could be a user-provided vector as in
alleles2loc
, or the function could test if adjacent SNPs have identicalsize
across all individuals. This is needed not only to generate thealleles2loc
vector, but also to determine if thealleleDepth
matrix needs to be padded out with alternate alleles. - The user will have to supply allele sequences for the
alleleNucleotides
slot. - A minimal
locTable
can be generated with locus names as the row names. An option for the user to add in chromosome and position would be helpful. -
possiblePloidies
can be generated from the ploidy listed insnpdf
. It needs to be a list even though it only contains one item. -
contamRate
can be taken frommedian(multidog_out$snpdf$seq)
. - After initialization of the
RADdata
object, the$posteriorProb
slot should be added. It should be a list of length 1 containing a named three-dimensional array with values from thePr_k
columns ofinddf
. (You can run through some of the tutorials or examples to see what aRADdata
object should look like after genotype calling.) - The
$possiblePloidies
slot should be copied to the$priorProbPloidies
slot.
Misc
Documentation of the RADdata
class: https://github.com/lvclark/polyRAD/wiki/RADdata
From your profile I'm not sure if you're new to R... If it is a new language to you, be aware that loops are very slow because the code gets reinterpreted on each iteration. Many functions and operations can process an entire vector/matrix/array at once, however.
If you are new to bioinformatics, what we're trying to accomplish is genotype calling, which in a diploid is basically determining whether an individual is AA, Aa, or aa at a particular site (AKA locus/marker/SNP/gene) in the genome. What we have is a random sample of DNA sequence, where the locus has usually been sequenced multiple times. The "read depth" is the number of times we see the sequence for a given allele or a given locus. Using that read depth, along with information about the population of individuals being studied, we can use Bayesian statistics to get a posterior probability of each genotype AA, Aa, and aa being the true genotype. In polyploids it is more complicated, for example in a tetraploid you could have AAAA, AAAa, AAaa, Aaaa, or aaaa.
Updog only supports two alleles per locus. polyRAD supports any number of alleles per locus, but treats them as "pseudo-biallelic", where each allele is treated as a marker and each read either belongs to that allele or does not. Hence multiple markers in updog might correspond to a single marker in polyRAD.