polyRAD icon indicating copy to clipboard operation
polyRAD copied to clipboard

Export to additional population genetics formats

Open lvclark opened this issue 3 years ago • 3 comments

Overview

There has been some interest in using genotypes called by polyRAD for population genetics analysis. Two pieces of software that have long supported polyploid genotypes are Structure and SPAGeDi. Although they were originally designed for relatively small sets of PCR markers, they both scale up to large sequence-based datasets. They each use a custom format, and it would be nice if polyRAD exported to those formats.

I may eventually get to writing these export functions. If you would like to see them (or others), feel free to comment here! If you have some R experience and would like to write the functions yourself, see below. First-time contributors are welcome!

Contributing

You will need to make your own fork of polyRAD, clone that fork to your computer, commit your changes, push those commits to GitHub, and open a pull request. See the Github documentation if you need help understanding those tasks.

R code

Name the function Export_Structure or Export_SPAGeDi, as appropriate. Add it to the file R/data_export.R.

The first argument for the function should be a RADdata object named object. There should also be an argument called file indicating the path to write to. Another possible argument might indicate a particular ploidy to use for export; see other export functions, and check the documentation for Structure and SPAGeDi to see if variable ploidy is allowed.

The function will need to call GetProbableGenotypes with omit1allelePerLocus = FALSE and multiallelic = "correct". This will get you a list, the first item of which is a matrix with individuals and rows and alleles in columns, showing the copy number of that allele in that individual. This will then need to be converted to a different format, where each allele is assigned an integer, and the individual has a number of alleles up to its ploidy. For example:

Allele_AG Allele_AA Allele_CA
Ind 1 0 3 1
Ind 2 2 2 0

Gets changed to

Allele 1 Allele 2 Allele 3 Allele 4
Ind 1 2 2 2 3
Ind 2 1 1 2 2

Your function will make use of object$alleles2loc in order to group columns of the matrix output by GetProbableGenotypes into loci. A loop in R to process loci might look like:

geno <- GetProbableGenotypes(object, omit1allelePerLocus = FALSE)[[1]]

for(L in seq_len(nLoci(object)){
  theseal <- which(object$alleles2loc == L)
  thesegeno <- geno[, theseal]
  # do some processing here
}

See the MakeGTstrings internal function in src/PrepVCFexport.cpp. For SPAGeDi in particular this function could simply have an argument added to it to start numbering from 1 rather than 0. For Structure, MakeGTstrings should probably not be used directly but can provide some inspiration for what needs to be done.

See the Structure or SPAGeDi documentation for details on file format needed. Your function should export a text file in that format.

Test the function on exampleRAD after running IterateHWE. Install Structure or SPAGeDi on your computer and see if it can import the file.

Documentation and integration

  • Update man/ExportGAPIT.Rd to document your function.
  • Add the function name to the export call in NAMESPACE
  • Update the "Formats supported" section of README.md
  • Update the "Summary of available functions" section of vignettes/polyRADtutorial.Rmd.
  • Run vignettes/render_vignettes.R , with vignettes as the working directory, to recompile the vignette.
  • Add yourself as a contributor to the DESCRIPTION file! Use the "ctb" role code in Authors@R and Author.
  • Run R CMD build and R CMD check to make sure the package builds and passes checks.

lvclark avatar Jul 20 '20 19:07 lvclark