goci
goci copied to clipboard
Investigate efficiency of harmonisation pipeline for WGS studies
We would like to verify whether there is a significant difference in efficiency of the hm pipeline for seq GWAS.
- Calculate the average % of dropped and unable to harmonise variants among a representative sample of array-based summary statistics.
- Calculate the average % of dropped and unable to harmonise variants
If possible it could be useful to analyse separately for GWAS-SSF and pre-GWAS-SSF formats
Genome-wide sequencing:
GWAS_id | Techniques | harmonised | Raw_rows | Harmonised_rows | hm_14 | hm_15 | hm_16 | Drop_ration | hm_15(%) | |
---|---|---|---|---|---|---|---|---|---|---|
1 | GCST90010173 | Genome-wide sequencing | yes | 24181159 | 18290576 | 0 | 0 | 0 | 24.36% | 0.00% |
2 | GCST90093113 | Genome-wide sequencing | yes | 7173861 | 7164907 | 0 | 47998 | 0 | 0.12% | 0.67% |
3 | GCST90001390 | Genome-wide sequencing | yes | 7843596 | 7654311 | 0 | 45969 | 2 | 2.41% | 0.59% |
4 | GCST90014052 | Genome-wide sequencing | yes | 5056041 | 5056029 | 0 | 11168 | 2 | 0.00% | 0.22% |
5 | GCST90161593 | Genome-wide sequencing | yes | 10004360 | 9450643 | 0 | 0 | 0 | 5.53% | 0.00% |
V.S. Genome-wide genotyping array:
PMID | GCST_id | genotyping array | harmonised | Raw_rows | Harmonised_rows | hm_14 | hm_15 | hm_16 | Drop_ration | hm_15(%) |
---|---|---|---|---|---|---|---|---|---|---|
33589840 | GCST90012878 | Genome-wide genotyping array | yes | 25643629 | 25367157 | 0 | 292056 | 67 | 1.08% | 1.14% |
28887542 | GCST005069 | Genome-wide genotyping array | yes | 25290284 | 25186082 | 19 | 179244 | 2 | 0.41% | 0.71% |
33782385 | GCST012278 | Genome-wide genotyping array | yes | 7216416 | 7180648 | 1 | 35317 | 0 | 0.50% | 0.49% |
33143745 | GCST90093334 | Genome-wide genotyping array | yes | 8034880 | 7982170 | 0 | 131524 | 18 | 0.66% | 1.64% |
30053915 | GCST006353 | Genome-wide genotyping array | yes | 5694112 | 5692296 | 7 | 24756 | 0 | 0.03% | 0.43% |
Next to do:
- GCST90010173, and GCST90161593: explore the reason why ~20% variants are dropped in the harmonised file.
- Run harmonisation against the new ensemble version as well.
- Priority to harmonise data for 33937362, 35381062, 36124557, 36206743, 36327219, 36349687.
- Reason why variants are dropped
- GCST90010173: contains lots of variants that reference allele=alternative allele ~ 10.5%; ~14% variants cannot find VCF records
- GCST90161593: 5% variants cannot find VCF records
- Run harmonisation against the new ensemble version as well.
V_95 (2018) | V_105 (2021) | V_111 (2023) | Total variants | 2% variants | |
---|---|---|---|---|---|
GCST90010173 | 75.64% | 75.76% | 78.39% | 24181159 | 483623.18 |
GCST90179391 | 79.04% | 74.38% | 77.75% | 30566328 | 611326.56 |
For variants that can be harmonised by V_95 but not V_111, it happens to two conditions:
- Some indels: variants representation in V_95 and V_111 are 1-base shift. These VCF records cannot be retrieved from the V_111 correctly and their indels representations are different from the input file.
- Some snps: multiple records can be retrieved from V_111, our pipeline does not know which is the correct one.
@ljwh2 Can we close this ticket? After our investigation:
- There were only 5 whole genome sequencing data.
- The rate of harmonization among these five studies varies widely, ranging from 75% to 100%.
- Compared to Genome-wide genotyping array data, which harmonisation rate varied from 98.02% to 99.97%, 4. There is no strong conclusion that the harmonization rate is significantly lower among the WGS.
We also tried to investigate if the updated reference VCF file improved the harmonisation rate among the WGS data, we tested on 8 studies, and 3 studies increased the rate and the other 5 decreased. Therefore, new reference VCF does not necessarily improve the harmonisation rate.
Our collaborator mentioned that they cannot use the variants that cannot be harmonised.