pbwt
pbwt copied to clipboard
Some rsIDs in original VCF not in imputed VCF
Looking only at chromosome 3, some 2k rsIDs (from 23andMe data) are not found in the 2.8G SNPs in imputed chromosome 3... Why should SNPs be dropped from the input in the output?
Many thanks.
See:
https://bioinformatics.stackexchange.com/questions/18728/confusing-result-from-sanger-imputation-service-eagle-v2-4-for-phasing-pbwt-v3
We can only impute at sites shared with the reference panel. The 2k rsIDs you refer to must not be in the reference panel you are using - you can easily check that by comparing the lists of sites. Indeed, pbwt tells you how many sites are being used I believe.
Thanks for this info.
This was the explanation that I was guessing. However, these SNPs missing from the imputation result are in my data, they don't need to be imputed, so it's strange that they don't find their way into the result don't you think?
These are the 'anchors' that the other data is derived from, so dropping any of them is a problem (I'm guessing).
Obviously I can merge the files, but that's a bit of a pain.
Interestingly, I'm processing the 23andMe (v3) file along with the imputation results, and I find that there are 1,806 rsIDs that can be added back to the imputation results by matching chromosome and position. e.g. they are in the imputation panel after all, but they don't have the correct rsID or the rsID is somehow dropped at one stage or another.
So of the 2,479 'missing' SNPs, 1,806 can be 'found' leaving 673 'anchors' missing from chromosome 3.
On a related note, I see some rsIDs with multiple positions in the results (this is different from the variations with > 1 alt allele we discussed elsewhere). e.g.
3 4942430 rs71634747 G A . PASS RefPanelAF=0.445673;AN=2;AC=1;INFO=1 GT:ADS:DS:GP 1|0:1,0:1:0,1,0
3 4942432 rs71634747 C G . PASS RefPanelAF=0.248907;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
When I check that rsID, I can see that it was merged with two other rsIDs at these locations: https://www.ncbi.nlm.nih.gov/snp/rs71634747 https://www.ncbi.nlm.nih.gov/snp/rs724135 https://www.ncbi.nlm.nih.gov/snp/rs3762784
Which sort of makes sense, sort of not... from somewhere the two 'new' locations for this rsID have been found (4942430 and 4942432), but the new rsIDs have not. i.e. it's the same old rsID with the new locations.
I find about 30 of these cases in chromosome 3.
Sorry if this is overly pedantic... I'm honestly not sure how else to work!
Many thanks, Dan.
So I just checked and all but 2 of the 31 rsIDs in the imputation results for chromosome 3 have been merged into two separate rsIDs, and all the distances between them are less than 10bp. Although this isn't a big problem it's still confusing regarding new position / old rsID.
I guess this may be a bug / version mix up somewhere?
The two rsIDs with two positions and no apparent cause in dbSNP are:
- https://www.ncbi.nlm.nih.gov/snp/rs71637536 and
- https://www.ncbi.nlm.nih.gov/snp/rs71632245
Here is how they appear in the file:
3 89056114 rs71632245 C T . PASS RefPanelAF=0.519495;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
3 89056116 rs71632245 A G . PASS RefPanelAF=0.543132;AN=2;AC=2;INFO=1 GT:ADS:DS:GP 1|1:1,1:2:0,0,1
--
3 89056114 rs71632245 C T . PASS RefPanelAF=0.519495;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
3 89056116 rs71632245 A G . PASS RefPanelAF=0.543132;AN=2;AC=2;INFO=1 GT:ADS:DS:GP 1|1:1,1:2:0,0,1
I assume this is a version inconsistency with the data somewhere.
More possibly related 'weirdness':
3 103279 rs555415488 G A . PASS RefPanelAF=0.000323375;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
3 103279 . G T . PASS RefPanelAF=0.000107792;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
And just for completeness...
3 50253604 rs750257636 C A . PASS RefPanelAF=7.69941e-05;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
3 50253604 rs587737274 C G . PASS RefPanelAF=0.000323375;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
--
3 178156578 rs10212245 T G . PASS RefPanelAF=0.00203265;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
3 178156578 rs200836430 T C . PASS RefPanelAF=0.117062;AN=2;AC=0;INFO=1 GT:ADS:DS:GP 0|0:0,0:0:1,0,0
@richarddurbin Sorry, I know the above details are a pain to 'parse', but hopefully the problems are clear enough. Please let me know if any of the above issues are unclear. Here is the python code that I used to pull everything out: https://github.com/Geromics/covcheck/blob/wip/report-v2/Research/debug_imputation_results.py
Please let me know if I should log these problems elsewhere.
I have some questions about phasing / imputation in general, I wonder if you or one of your colleagues could spare some time to talk me through some details?
Many thanks, Dan.