gentropy icon indicating copy to clipboard operation
gentropy copied to clipboard

feat: logic and airflow pipeline for validation

Open DSuveges opened this issue 6 months ago • 0 comments

Included:

  • [x] Study validation step + config
  • [x] StudyLocus validation step + config
  • [x] Airflow DAG to run validation steps

Not included:

These steps for the sake of simplicity were not included in the gentropy ETL DAG.

QC

Included datasets:

# Input datasets:
STUDY_INDICES = [
    "gs://gwas_catalog_data/study_index",
    "gs://eqtl_catalogue_data/study_index",
    "gs://finngen_data/r10/study_index",
]
STUDY_LOCI = [
    "gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_PICSed_curated_associations",
    "gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_PICSed_summary_statistics",
    "gs://eqtl_catalogue_data/credible_set_datasets/susie",
    "gs://finngen_data/r10/credible_set_datasets/finngen_susie_processed",
]
TARGET_INDEX = "gs://genetics_etl_python_playground/releases/24.06/gene_index"
DISEASE_INDEX = "gs://open-targets-pre-data-releases/24.06/output/etl/parquet/diseases"

# Output datasets:
VALIDATED_STUDY = "gs://ot-team/dsuveges/otg-data/validated_study_index"
VALIDATED_STUDY_LOCI = "gs://ot-team/dsuveges/otg-data/validated_credible_set"

QC Flagged studies by project ids:

+---------------------+----------------------------------------------------+-----+
|projectId            |qc                                                  |count|
+---------------------+----------------------------------------------------+-----+
|Alasoo_2018          |Target/gene identifier could not match to reference.|30   |
|Aygun_2021           |Target/gene identifier could not match to reference.|1    |
|BLUEPRINT            |Target/gene identifier could not match to reference.|82   |
|Bossini-Castillo_2019|Target/gene identifier could not match to reference.|25   |
|BrainSeq             |Target/gene identifier could not match to reference.|53   |
|Braineac2            |Target/gene identifier could not match to reference.|5    |
|CAP                  |Target/gene identifier could not match to reference.|21   |
|CEDAR                |Target/gene identifier could not match to reference.|17   |
|CommonMind           |Target/gene identifier could not match to reference.|50   |
|Cytoimmgen           |Target/gene identifier could not match to reference.|39   |
|FINNGEN_R10          |No valid disease identifier found.                  |2408 |
|FUSION               |Target/gene identifier could not match to reference.|69   |
|Fairfax_2012         |Target/gene identifier could not match to reference.|3    |
|Fairfax_2014         |Target/gene identifier could not match to reference.|15   |
|GCST                 |Failed summary statistics quality control           |472  |
|GCST                 |Non-additive model                                  |32   |
|GCST                 |No valid disease identifier found.                  |7146 |
|GCST                 |The identifier of this study is not unique.         |22   |
|GENCORD              |Target/gene identifier could not match to reference.|25   |
|GEUVADIS             |Target/gene identifier could not match to reference.|22   |
|GTEx                 |Target/gene identifier could not match to reference.|1317 |
|Gilchrist_2021       |Target/gene identifier could not match to reference.|5    |
|HipSci               |Target/gene identifier could not match to reference.|40   |
|Kasela_2017          |Target/gene identifier could not match to reference.|4    |
|Lepik_2017           |Target/gene identifier could not match to reference.|26   |
|Naranbhai_2015       |Target/gene identifier could not match to reference.|1    |
|Nathan_2022          |Target/gene identifier could not match to reference.|38   |
|Nedelec_2016         |Target/gene identifier could not match to reference.|40   |
|OneK1K               |Target/gene identifier could not match to reference.|21   |
|PISA                 |Target/gene identifier could not match to reference.|8    |
|Peng_2018            |Target/gene identifier could not match to reference.|2    |
|Perez_2022           |Target/gene identifier could not match to reference.|7    |
|PhLiPS               |Target/gene identifier could not match to reference.|11   |
|Quach_2016           |Target/gene identifier could not match to reference.|104  |
|ROSMAP               |Target/gene identifier could not match to reference.|56   |
|Randolph_2021        |Target/gene identifier could not match to reference.|1    |
|Schmiedel_2018       |Target/gene identifier could not match to reference.|119  |
|Schwartzentruber_2018|Target/gene identifier could not match to reference.|6    |
|Steinberg_2020       |Target/gene identifier could not match to reference.|15   |
|TwinsUK              |Target/gene identifier could not match to reference.|83   |
|Walker_2019          |Target/gene identifier could not match to reference.|15   |
|iPSCORE              |Target/gene identifier could not match to reference.|7    |
|van_de_Bunt_2015     |Target/gene identifier could not match to reference.|2    |
+---------------------+----------------------------------------------------+-----+

QC Flagged GWAS Catalog credible sets

+---------+----------------------------------------------------------------+------+
|projectId|qc                                                              |count |
+---------+----------------------------------------------------------------+------+
|GCST     |Variant inconsistency                                           |81691 |
|GCST     |Subsignificant p-value                                          |165196|
|GCST     |Composite association                                           |1956  |
|GCST     |LD block does not contain variants at the required R^2 threshold|88166 |
|GCST     |Variant not found in LD reference                               |124082|
|GCST     |Palindrome alleles - cannot harmonize                           |66183 |
|GCST     |Explained by a more significant variant in high LD (clumped)    |66510 |
|GCST     |Study has failed quality controls                               |24518 |
|GCST     |No mapping in GnomAd                                            |82932 |
|GCST     |Incomplete genomic mapping                                      |81684 |
|GCST     |Non-unique study locus identifier                               |146074|
+---------+----------------------------------------------------------------+------+

QC Flagged credible sets with study related issues:

+---------------------+---------------------------------+------+
|projectId            |qc                               |count |
+---------------------+---------------------------------+------+
|GCST                 |Non-unique study locus identifier|146074|
|Alasoo_2018          |Study has failed quality controls|35    |
|Aygun_2021           |Study has failed quality controls|1     |
|BLUEPRINT            |Study has failed quality controls|107   |
|Bossini-Castillo_2019|Study has failed quality controls|27    |
|BrainSeq             |Study has failed quality controls|57    |
|Braineac2            |Study has failed quality controls|7     |
|CAP                  |Study has failed quality controls|21    |
|CEDAR                |Study has failed quality controls|18    |
|CommonMind           |Study has failed quality controls|55    |
|Cytoimmgen           |Study has failed quality controls|57    |
|FINNGEN_R10          |Study has failed quality controls|13966 |
|FUSION               |Study has failed quality controls|77    |
|Fairfax_2012         |Study has failed quality controls|3     |
|Fairfax_2014         |Study has failed quality controls|16    |
|GCST                 |Study has failed quality controls|24518 |
|GENCORD              |Study has failed quality controls|26    |
|GEUVADIS             |Study has failed quality controls|27    |
|GTEx                 |Study has failed quality controls|1498  |
|Gilchrist_2021       |Study has failed quality controls|7     |
|HipSci               |Study has failed quality controls|44    |
|Kasela_2017          |Study has failed quality controls|4     |
|Lepik_2017           |Study has failed quality controls|35    |
|Naranbhai_2015       |Study has failed quality controls|1     |
|Nathan_2022          |Study has failed quality controls|39    |
|Nedelec_2016         |Study has failed quality controls|46    |
|OneK1K               |Study has failed quality controls|37    |
|PISA                 |Study has failed quality controls|9     |
|Peng_2018            |Study has failed quality controls|2     |
|Perez_2022           |Study has failed quality controls|7     |
|PhLiPS               |Study has failed quality controls|11    |
|Quach_2016           |Study has failed quality controls|132   |
|ROSMAP               |Study has failed quality controls|68    |
|Randolph_2021        |Study has failed quality controls|2     |
|Schmiedel_2018       |Study has failed quality controls|136   |
|Schwartzentruber_2018|Study has failed quality controls|6     |
|Steinberg_2020       |Study has failed quality controls|17    |
|TwinsUK              |Study has failed quality controls|102   |
|Walker_2019          |Study has failed quality controls|17    |
|iPSCORE              |Study has failed quality controls|8     |
|van_de_Bunt_2015     |Study has failed quality controls|2     |
+---------------------+---------------------------------+------+

DSuveges avatar Aug 16 '24 17:08 DSuveges