sgn
sgn copied to clipboard
solGS: add option to use genotype data from multiple genotyping protocols
-- genotyping protocols have overlapping markers and filtering for shared markers among protocols allows including accessions genotyped using different protocols in analyses pipelines. (request from Marnin)
Can be dangerous because the same coordinate may be a different genomic region in a different protocol
Dangerous, I agree. I do this only with care and knowledge. However, here's an example of why:
There is a genotyping protocol "West Africa 2020" that has the NRCRI training population. There is a separate one "NRCRI DarT-GBS 2021" that has the latest NRCRI offspring. In order to make the prediction, the training population phenos+genos and the offspring genos need to be joined. If I guess correctly, that workflow is not currently possible without uploading a "genotyping protocol" that merges the two?
There may be alternative solutions for this, but the standard workflow in the future will generate these disjoint VCF files.
Yeah I am struggling with the same issue. My solution is to impute everything to the same genotyping protocol. It's a bit of a pain, but you would need to impute everything to the same marker set before running predictions anyway, right?
Dangerous, I agree. I do this only with care and knowledge.
Would you elaborate on the care and knowledge you apply?
There is a genotyping protocol "West Africa 2020" that has the NRCRI training population. There is a separate one "NRCRI DarT-GBS 2021" that has the latest NRCRI offspring. In order to make the prediction, the training population phenos+genos and the offspring genos need to be joined. If I guess correctly, that workflow is not currently possible without uploading a "genotyping protocol" that merges the two?
created a ticket for this: #3858
There is a genotyping protocol "West Africa 2020" that has the NRCRI training population. There is a separate one "NRCRI DarT-GBS 2021" that has the latest NRCRI offspring.
How much is the overlap of clones between successive genotyping protocols? I thought the same clones in the "West Africa 2020" were also in the "NRCRI DarT-GBS 2021".
Would Chris's approach solve this issue in the future?
Yeah I am struggling with the same issue. My solution is to impute everything to the same genotyping protocol. It's a bit of a pain, but you would need to impute everything to the same marker set before running predictions anyway, right?
Alternatively, you can merge the VCFs you've imputed separately and then upload them. But it's just going to generate VCFs that get successively larger as subsequent selection cycles get genotyped, imputed and then need to be merged and uploaded. Perhaps they could be periodically overwritten with newer more inclusive files?
Dangerous, I agree. I do this only with care and knowledge.
Would you elaborate on the care and knowledge you apply?
Well my whole pipeline for imputation is set-up to generate compatible files. In the NRCRI example that I gave, the imputation reference panel was imputed with Beagle and contains a combination of DArT and GBS-genotyped sites. The offspring were genotyped later using DArT and the ref. panel using to impute them. The progeny that get imputed tend to have a subset of sites in the reference panel, because I do some post-imputation QC.
Equally important is that the reference genome, chromosome+position and Ref-Alt alleles between datasets are a match.
Hopefully that clarifies a bit?