sgn solGS: add option to use genotype data from multiple genotyping protocols

-- genotyping protocols have overlapping markers and filtering for shared markers among protocols allows including accessions genotyped using different protocols in analyses pipelines. (request from Marnin)

Nov 29 '21 10:11 isaak

Can be dangerous because the same coordinate may be a different genomic region in a different protocol

Nov 29 '21 16:11 lukasmueller

Dangerous, I agree. I do this only with care and knowledge. However, here's an example of why:

There is a genotyping protocol "West Africa 2020" that has the NRCRI training population. There is a separate one "NRCRI DarT-GBS 2021" that has the latest NRCRI offspring. In order to make the prediction, the training population phenos+genos and the offspring genos need to be joined. If I guess correctly, that workflow is not currently possible without uploading a "genotyping protocol" that merges the two?

There may be alternative solutions for this, but the standard workflow in the future will generate these disjoint VCF files.

Nov 29 '21 16:11 wolfemd

Yeah I am struggling with the same issue. My solution is to impute everything to the same genotyping protocol. It's a bit of a pain, but you would need to impute everything to the same marker set before running predictions anyway, right?

Nov 30 '21 01:11 ch728

Dangerous, I agree. I do this only with care and knowledge.

Would you elaborate on the care and knowledge you apply?

Nov 30 '21 09:11 isaak

There is a genotyping protocol "West Africa 2020" that has the NRCRI training population. There is a separate one "NRCRI DarT-GBS 2021" that has the latest NRCRI offspring. In order to make the prediction, the training population phenos+genos and the offspring genos need to be joined. If I guess correctly, that workflow is not currently possible without uploading a "genotyping protocol" that merges the two?

created a ticket for this: #3858

Nov 30 '21 09:11 isaak

There is a genotyping protocol "West Africa 2020" that has the NRCRI training population. There is a separate one "NRCRI DarT-GBS 2021" that has the latest NRCRI offspring.

How much is the overlap of clones between successive genotyping protocols? I thought the same clones in the "West Africa 2020" were also in the "NRCRI DarT-GBS 2021".

Would Chris's approach solve this issue in the future?

Nov 30 '21 09:11 isaak

Yeah I am struggling with the same issue. My solution is to impute everything to the same genotyping protocol. It's a bit of a pain, but you would need to impute everything to the same marker set before running predictions anyway, right?

Alternatively, you can merge the VCFs you've imputed separately and then upload them. But it's just going to generate VCFs that get successively larger as subsequent selection cycles get genotyped, imputed and then need to be merged and uploaded. Perhaps they could be periodically overwritten with newer more inclusive files?

Nov 30 '21 11:11 wolfemd

Dangerous, I agree. I do this only with care and knowledge.

Would you elaborate on the care and knowledge you apply?

Well my whole pipeline for imputation is set-up to generate compatible files. In the NRCRI example that I gave, the imputation reference panel was imputed with Beagle and contains a combination of DArT and GBS-genotyped sites. The offspring were genotyped later using DArT and the ref. panel using to impute them. The progeny that get imputed tend to have a subset of sites in the reference panel, because I do some post-imputation QC.

Equally important is that the reference genome, chromosome+position and Ref-Alt alleles between datasets are a match.

Hopefully that clarifies a bit?

Nov 30 '21 11:11 wolfemd

sgn sgn copied to clipboard

solGS: add option to use genotype data from multiple genotyping protocols

sgn
sgn copied to clipboard