gentropy icon indicating copy to clipboard operation
gentropy copied to clipboard

feat(airflow): prototype finemapping batch job

Open ireneisdoomed opened this issue 10 months ago • 1 comments

New Airflow DAG that calls the susie finemapper step on a list of studyLocus IDs to generate credible sets.

✨ Context

This is an incremental pipeline, so the first 3 tasks involve the operation of generating a list of studyLocus to finemap. Once the difference is computed, we finemap those IDs as a job to Google Batch.

image

Graph nodes:

  • get_all_study_locus_ids. Lists the bucket with the clumped study locus IDs and extracts all IDs that we could theoretically finemap based on the filename. A requirement for this step is that clumped study loci are written partitioned by their ID - which is currently not implemented (e.g. gs://genetics-portal-dev-analysis/irene/toy_studdy_locus_alzheimer_partitioned)
  • get_finemapped_paths. Lists the bucket that contains all credible sets as a result of fine-mapping and extracts the IDs from the filename. Similarly, it is also required that the SLID is part of the filename. This is partially implemented, already. The data is not partitioned, but the ID is used to build the output path.
  • get_study_loci_to_finemap. Creates a list of IDs based on the difference between get_all_study_locus_ids and get_finemapped_paths.
  • finemapping_task. The one that interfaces with the Google Batch operator to create one Batch job that runs the Docker container with as many tasks in parallel as studyLoci IDs we have extracted in get_study_loci_to_finemap. The command run in the container image calls the fine mapping step with the appropriate parameters.

🛠 What does this PR implement

  • The DAG explained above.
  • A small change in the logic of SusieFineMapperStep that instead of passing the row of the studyLocus to finemap, it passes the dataframe (containing that one row). This is to prevent incompatibilities between the input data, and the schema (I had the problem that after partitioning the data by SLID and then reading the data, this column was appended at the end)

🙈 Missing

  • Testing that the logic works after fixing the schema issues. This requires creating a new image.
  • Fine tuning the resources allocated in the Batch job. This crosses over with the work done by @tskir
  • Sorting out the input data for this DAG:
    • First, we need to partitioning the data prior to the finemapping step as required by the get_all_study_locus_ids node. This involves changing ld_based_clumping.py
    • Then, we have to be mindful that we want to fine map studyLoci from 2 sources: UKBB PPP and GWAS Catalog. So either we run the DAG twice or we put all clumped study loci under the same location.

🚦 Before submitting

  • [ ] Do these changes cover one single feature (one change at a time)?
  • [X] Did you read the contributor guideline?
  • [ ] Did you make sure to update the documentation with your changes?
  • [X] Did you make sure there is no commented out code in this PR?
  • [X] Did you follow conventional commits standards in PR title and commit messages?
  • [X] Did you make sure the branch is up-to-date with the dev branch?
  • [ ] Did you write any new necessary tests?
  • [X] Did you make sure the changes pass local tests (make test)?
  • [X] Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

ireneisdoomed avatar Apr 23 '24 14:04 ireneisdoomed