gentropy
gentropy copied to clipboard
feat: logic and airflow pipeline for validation
Included:
- [x] Study validation step + config
- [x] StudyLocus validation step + config
- [x] Airflow DAG to run validation steps
Not included:
These steps for the sake of simplicity were not included in the gentropy ETL DAG.
QC
Included datasets:
# Input datasets:
STUDY_INDICES = [
"gs://gwas_catalog_data/study_index",
"gs://eqtl_catalogue_data/study_index",
"gs://finngen_data/r10/study_index",
]
STUDY_LOCI = [
"gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_PICSed_curated_associations",
"gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_PICSed_summary_statistics",
"gs://eqtl_catalogue_data/credible_set_datasets/susie",
"gs://finngen_data/r10/credible_set_datasets/finngen_susie_processed",
]
TARGET_INDEX = "gs://genetics_etl_python_playground/releases/24.06/gene_index"
DISEASE_INDEX = "gs://open-targets-pre-data-releases/24.06/output/etl/parquet/diseases"
# Output datasets:
VALIDATED_STUDY = "gs://ot-team/dsuveges/otg-data/validated_study_index"
VALIDATED_STUDY_LOCI = "gs://ot-team/dsuveges/otg-data/validated_credible_set"
QC Flagged studies by project ids:
+---------------------+----------------------------------------------------+-----+
|projectId |qc |count|
+---------------------+----------------------------------------------------+-----+
|Alasoo_2018 |Target/gene identifier could not match to reference.|30 |
|Aygun_2021 |Target/gene identifier could not match to reference.|1 |
|BLUEPRINT |Target/gene identifier could not match to reference.|82 |
|Bossini-Castillo_2019|Target/gene identifier could not match to reference.|25 |
|BrainSeq |Target/gene identifier could not match to reference.|53 |
|Braineac2 |Target/gene identifier could not match to reference.|5 |
|CAP |Target/gene identifier could not match to reference.|21 |
|CEDAR |Target/gene identifier could not match to reference.|17 |
|CommonMind |Target/gene identifier could not match to reference.|50 |
|Cytoimmgen |Target/gene identifier could not match to reference.|39 |
|FINNGEN_R10 |No valid disease identifier found. |2408 |
|FUSION |Target/gene identifier could not match to reference.|69 |
|Fairfax_2012 |Target/gene identifier could not match to reference.|3 |
|Fairfax_2014 |Target/gene identifier could not match to reference.|15 |
|GCST |Failed summary statistics quality control |472 |
|GCST |Non-additive model |32 |
|GCST |No valid disease identifier found. |7146 |
|GCST |The identifier of this study is not unique. |22 |
|GENCORD |Target/gene identifier could not match to reference.|25 |
|GEUVADIS |Target/gene identifier could not match to reference.|22 |
|GTEx |Target/gene identifier could not match to reference.|1317 |
|Gilchrist_2021 |Target/gene identifier could not match to reference.|5 |
|HipSci |Target/gene identifier could not match to reference.|40 |
|Kasela_2017 |Target/gene identifier could not match to reference.|4 |
|Lepik_2017 |Target/gene identifier could not match to reference.|26 |
|Naranbhai_2015 |Target/gene identifier could not match to reference.|1 |
|Nathan_2022 |Target/gene identifier could not match to reference.|38 |
|Nedelec_2016 |Target/gene identifier could not match to reference.|40 |
|OneK1K |Target/gene identifier could not match to reference.|21 |
|PISA |Target/gene identifier could not match to reference.|8 |
|Peng_2018 |Target/gene identifier could not match to reference.|2 |
|Perez_2022 |Target/gene identifier could not match to reference.|7 |
|PhLiPS |Target/gene identifier could not match to reference.|11 |
|Quach_2016 |Target/gene identifier could not match to reference.|104 |
|ROSMAP |Target/gene identifier could not match to reference.|56 |
|Randolph_2021 |Target/gene identifier could not match to reference.|1 |
|Schmiedel_2018 |Target/gene identifier could not match to reference.|119 |
|Schwartzentruber_2018|Target/gene identifier could not match to reference.|6 |
|Steinberg_2020 |Target/gene identifier could not match to reference.|15 |
|TwinsUK |Target/gene identifier could not match to reference.|83 |
|Walker_2019 |Target/gene identifier could not match to reference.|15 |
|iPSCORE |Target/gene identifier could not match to reference.|7 |
|van_de_Bunt_2015 |Target/gene identifier could not match to reference.|2 |
+---------------------+----------------------------------------------------+-----+
QC Flagged GWAS Catalog credible sets
+---------+----------------------------------------------------------------+------+
|projectId|qc |count |
+---------+----------------------------------------------------------------+------+
|GCST |Variant inconsistency |81691 |
|GCST |Subsignificant p-value |165196|
|GCST |Composite association |1956 |
|GCST |LD block does not contain variants at the required R^2 threshold|88166 |
|GCST |Variant not found in LD reference |124082|
|GCST |Palindrome alleles - cannot harmonize |66183 |
|GCST |Explained by a more significant variant in high LD (clumped) |66510 |
|GCST |Study has failed quality controls |24518 |
|GCST |No mapping in GnomAd |82932 |
|GCST |Incomplete genomic mapping |81684 |
|GCST |Non-unique study locus identifier |146074|
+---------+----------------------------------------------------------------+------+
QC Flagged credible sets with study related issues:
+---------------------+---------------------------------+------+
|projectId |qc |count |
+---------------------+---------------------------------+------+
|GCST |Non-unique study locus identifier|146074|
|Alasoo_2018 |Study has failed quality controls|35 |
|Aygun_2021 |Study has failed quality controls|1 |
|BLUEPRINT |Study has failed quality controls|107 |
|Bossini-Castillo_2019|Study has failed quality controls|27 |
|BrainSeq |Study has failed quality controls|57 |
|Braineac2 |Study has failed quality controls|7 |
|CAP |Study has failed quality controls|21 |
|CEDAR |Study has failed quality controls|18 |
|CommonMind |Study has failed quality controls|55 |
|Cytoimmgen |Study has failed quality controls|57 |
|FINNGEN_R10 |Study has failed quality controls|13966 |
|FUSION |Study has failed quality controls|77 |
|Fairfax_2012 |Study has failed quality controls|3 |
|Fairfax_2014 |Study has failed quality controls|16 |
|GCST |Study has failed quality controls|24518 |
|GENCORD |Study has failed quality controls|26 |
|GEUVADIS |Study has failed quality controls|27 |
|GTEx |Study has failed quality controls|1498 |
|Gilchrist_2021 |Study has failed quality controls|7 |
|HipSci |Study has failed quality controls|44 |
|Kasela_2017 |Study has failed quality controls|4 |
|Lepik_2017 |Study has failed quality controls|35 |
|Naranbhai_2015 |Study has failed quality controls|1 |
|Nathan_2022 |Study has failed quality controls|39 |
|Nedelec_2016 |Study has failed quality controls|46 |
|OneK1K |Study has failed quality controls|37 |
|PISA |Study has failed quality controls|9 |
|Peng_2018 |Study has failed quality controls|2 |
|Perez_2022 |Study has failed quality controls|7 |
|PhLiPS |Study has failed quality controls|11 |
|Quach_2016 |Study has failed quality controls|132 |
|ROSMAP |Study has failed quality controls|68 |
|Randolph_2021 |Study has failed quality controls|2 |
|Schmiedel_2018 |Study has failed quality controls|136 |
|Schwartzentruber_2018|Study has failed quality controls|6 |
|Steinberg_2020 |Study has failed quality controls|17 |
|TwinsUK |Study has failed quality controls|102 |
|Walker_2019 |Study has failed quality controls|17 |
|iPSCORE |Study has failed quality controls|8 |
|van_de_Bunt_2015 |Study has failed quality controls|2 |
+---------------------+---------------------------------+------+