polyRAD Export to and import from updog

It would be nice to have some convenience functions to convert between a RADdata object and the input and output of updog. This would allow users to take advantage of the file import and export options in polyRAD, while performing the genotype calling itself in updog (more accurate than polyRAD in some cases but much slower).

If you would like to add this feature and make a pull request, just comment here and I will give any help and guidance that I can. In particular see the multidog and format_multidog functions in updog. See also the checklist for pull requests.

Oct 15 '20 19:10 lvclark

@lvclark can I work on this issue?

Oct 16 '20 09:10 nk183

@nk183 Sure, thanks for doing this!

Going from polyRAD to updog

I recommend outputting a named list with names refmat, sizemat, and ploidy, to make it clear how the output can be passed to the arguments of multidog.
refmat should come from the $alleleDepth slot of the RADdata object. You can use the OneAllelePerMarker function with commonAlleles = TRUE to remove the common allele of each locus to avoid duplicate computations in updog. You will also need to transpose the matrix so alleles are in rows and taxa are in columns.
sizemat can either be derived from $alleleDepth + $antiAlleleDepth or from $locDepth. It needs to have the same dimensions and names as refmat.
ploidy can come from $possiblePloidies although there will probably need to be an argument to specify ploidy as well, since polyRAD supports multiple ploidies and updog does not.

Going from updog to polyRAD

multidog outputs a list of two items. inddf has most of the information that is needed:

The snp and ind columns can be used to generate column names and row names, respectively, for alleleDepth.
The ref and size columns can be used to generate the values for alleleDepth.
There needs to be a way to indicate or determine which "SNPs" actually belong to one locus. This could be a user-provided vector as in alleles2loc, or the function could test if adjacent SNPs have identical size across all individuals. This is needed not only to generate the alleles2loc vector, but also to determine if the alleleDepth matrix needs to be padded out with alternate alleles.
The user will have to supply allele sequences for the alleleNucleotides slot.
A minimal locTable can be generated with locus names as the row names. An option for the user to add in chromosome and position would be helpful.
possiblePloidies can be generated from the ploidy listed in snpdf. It needs to be a list even though it only contains one item.
contamRate can be taken from median(multidog_out$snpdf$seq).
After initialization of the RADdata object, the $posteriorProb slot should be added. It should be a list of length 1 containing a named three-dimensional array with values from the Pr_k columns of inddf. (You can run through some of the tutorials or examples to see what a RADdata object should look like after genotype calling.)
The $possiblePloidies slot should be copied to the $priorProbPloidies slot.

Misc

Documentation of the RADdata class: https://github.com/lvclark/polyRAD/wiki/RADdata

From your profile I'm not sure if you're new to R... If it is a new language to you, be aware that loops are very slow because the code gets reinterpreted on each iteration. Many functions and operations can process an entire vector/matrix/array at once, however.

If you are new to bioinformatics, what we're trying to accomplish is genotype calling, which in a diploid is basically determining whether an individual is AA, Aa, or aa at a particular site (AKA locus/marker/SNP/gene) in the genome. What we have is a random sample of DNA sequence, where the locus has usually been sequenced multiple times. The "read depth" is the number of times we see the sequence for a given allele or a given locus. Using that read depth, along with information about the population of individuals being studied, we can use Bayesian statistics to get a posterior probability of each genotype AA, Aa, and aa being the true genotype. In polyploids it is more complicated, for example in a tetraploid you could have AAAA, AAAa, AAaa, Aaaa, or aaaa.

Updog only supports two alleles per locus. polyRAD supports any number of alleles per locus, but treats them as "pseudo-biallelic", where each allele is treated as a marker and each read either belongs to that allele or does not. Hence multiple markers in updog might correspond to a single marker in polyRAD.

Oct 16 '20 15:10 lvclark

polyRAD polyRAD copied to clipboard

Export to and import from updog

polyRAD
polyRAD copied to clipboard