goci icon indicating copy to clipboard operation
goci copied to clipboard

UKB sumstats wrangling & ingest

Open ljwh2 opened this issue 11 months ago • 69 comments

Data associated with this project https://www.medrxiv.org/content/10.1101/2023.12.06.23299426v1 has been shared with Open Targets and needs ingesting into the GWAS Catalog. The data is presented in separate files for each chromosome, looks like ~35M variants per GWAS.

  • [x] Liaise with David & Daniel at Open Targets to get access to the data (currently in their cloud storage)
  • [x] Combine chromosome-specific files into a single file per GWAS (study)
    • [x] Quant
    • [x] Binary - restarted merge for corrupt studies
  • [x] Reformat to GWAS-SSF
    • [x] Quant - find at /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats_quant/
    • [x] Binary - find at /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats/
  • [x] Check file integrity of formatted files - find them at /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/
  • [ ] Clean up the rows with TEST_FAIL in the EXTRA column - SLURM jobs done but there are errors e.g. files with some chrs missing:
    • [ ] Investigate files with some chromosomes missing
  • [x] Verify that the validation pipeline can cope with files of this size
    • [x] Quant - sample file passed the validation
  • [ ] Wrangle the metadata template (see email attachment) to match the files, i.e., copy files to private ftp and compare md5sum values of the ones in private ftp and aws/formatted_long/
    • [ ] Quant - Quant file copy complete and compared the md5sum values of Quant files in the private ftp and aws/formatted/gwas_summary_stats_quant -- done this but need to restart again as we formatted files again
    • [ ] Binary - Copying files to the private ftp
  • [ ] Create submission on behalf of the author for immediate release (not under embargo)
    • [ ] Quant - Same error 2nd time, Yue will help with the error investigation https://app.zenhub.com/workspaces/gwas-59df823c4a6feb3786810391/issues/gh/ebispot/goci/1267#issuecomment-2188608204
    • [ ] Binary
  • [ ] Before queuing for harmonisation, harmonise one file and check variant dropout rate. Discuss results before proceeding.
    • [ ] Quant
    • [ ] Binary

@earlEBI can provide support in interpreting the template and especially with the template wrangling and submission steps.

ljwh2 avatar Mar 12 '24 11:03 ljwh2

Started copying files to /hps/nobackup

      65202370  standard goci1267  spotbot  R    1:13:10      1 hl-codon-111-03

karatugo avatar Mar 27 '24 09:03 karatugo

Copying complete. Now comparing MD5 checksums of the copied files to those listed in md5sums.txt.

karatugo avatar Mar 28 '24 14:03 karatugo

Submitted a SLURM job to calculate and compare md5sums of files in GCP and Codon.

      66126819  standard md5sum-c  spotbot  R      01:26      1 hl-codon-09-03

karatugo avatar Apr 03 '24 14:04 karatugo

I calculated and compared the MD5 sums of files in GCP and Codon. They matched.

karatugo avatar Apr 04 '24 07:04 karatugo

Submitted two SLURM jobs to combine chromosome-specific files into a single file per GWAS (study):

         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      66784780 datamover    merge  spotbot  R       2:20      1 codon-dm-05
        162292 datamover merge-re  spotbot  R       0:24      1 codon-dm-05

Note that all .txt.gz files are in gwas_summary_stats and all .regenie.gz files are in gwas_summary_stats_quant.

karatugo avatar Apr 08 '24 16:04 karatugo

I've noticed that some file MD5 sums are missing, for example for ./gwas_summary_stats/j92/. Additionally, there are warnings indicating that these files are corrupt when I try to combine them.

I did not detect this issue earlier because the affected files lack entries in md5sums.txt. I have reached out to Annalisa about this.

karatugo avatar Apr 15 '24 15:04 karatugo

I have access to S3 now.

karatugo avatar May 13 '24 16:05 karatugo

Submitted two SLURM jobs for the data copy to codon, namely, cp_ukb_aws_gwas_summary_stats and cp_ukb_aws_gwas_summary_stats_quant using the sbatch scripts /homes/spotbot/goci-1267/cp_from_aws_gwas_summary_stats.sh and /homes/spotbot/goci-1267/cp_from_aws_gwas_summary_stats_quant.sh

karatugo avatar May 17 '24 15:05 karatugo

Every file in our directory matches exactly with the files listed in md5sums.txt, and vice versa. See the script at /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/compare_md5sums.sh

karatugo avatar May 21 '24 16:05 karatugo

Submitted batch job 16272725 to compute and compare md5sums of the copied files.

karatugo avatar May 21 '24 16:05 karatugo

Compute and compare md5sums of the copied files done. The values matched.

karatugo avatar May 22 '24 10:05 karatugo

Submitted batch job 16375825 to backup copied files.

karatugo avatar May 22 '24 10:05 karatugo

Backup copied files complete.

karatugo avatar May 23 '24 15:05 karatugo

Submitted batch jobs 16648847 and 16648861 to combine chromosome-specific files into a single file per GWAS (study).

karatugo avatar May 23 '24 15:05 karatugo

16648861 - merge-regenie complete.

karatugo avatar May 28 '24 08:05 karatugo

For regenie studies:

No files found for blood_biochemistry_oest_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_oest_0 with ancestry EAS. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry AFR. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry SAS. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry EAS. Skipping...
No files found for blood_biochemistry_urma_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_urma_0 with ancestry EAS. Skipping...

karatugo avatar May 31 '24 14:05 karatugo

Submitted a gwas-ssf format SLURM job for formatting regenie files. Expect them in here in 2 days: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted/gwas_summary_stats_quant/

karatugo avatar May 31 '24 15:05 karatugo

16648847: merge_gwas_summary_stats Ended

karatugo avatar Jun 04 '24 10:06 karatugo

Attached is the skipped list for gwas_summary_stats

gwas_sumstats_skipped_list.txt

karatugo avatar Jun 04 '24 10:06 karatugo

Unfortunately Disk quota exceeded for spot/gwas/scratch/ and some of the merge operations failed. I'll talk to the Codon team for how to best navigate this issue.

karatugo avatar Jun 04 '24 11:06 karatugo

Talked to Codon team.

  • [ ] I'll move copied files and backup files to lts (and talk to Storage team if we don't already have space)
  • [x] Restart the merge job (ideally with parallel compression)

karatugo avatar Jun 04 '24 11:06 karatugo

Tested the validation with the file blood_biochemistry_ua_0_EAS_combined_formatted.regenie.gz, it worked okay

karatugo avatar Jun 11 '24 09:06 karatugo

merge job for gwas_summary_stats resumed.

Submitted batch job 22796315

karatugo avatar Jun 17 '24 13:06 karatugo

Done - a SLURM job for gathering data (e.g. md5sums of the files, calculating variant counts etc.) for the metadata template.

karatugo avatar Jun 19 '24 09:06 karatugo

Submitted a SLURM job for copying Quant files to the private ftp for test submission in sandbox.

      23899637 datamover cp-ukbb-quant-private-ftp                               spotbot  R  2:57:34    1      codon-dm-04

karatugo avatar Jun 21 '24 10:06 karatugo

Submission template is now ready. Lizzy updated it and fixed the errors.

karatugo avatar Jun 21 '24 10:06 karatugo

Possible errors found during the merge of Binary studies:

gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/n34/chr9_first_occurrence_n34_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/n34/chr9_first_occurrence_n34_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr18_first_occurrence_o35_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr18_first_occurrence_o35_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr21_first_occurrence_o35_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr21_first_occurrence_o35_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o91/chr20_first_occurrence_o91_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o91/chr20_first_occurrence_o91_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/p90_p96_other_disorders_originating_in_the_perinatal_period/chr7_first_occurrence_p90_p96_other_disorders_originating_in_the_perinatal_period_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/p90_p96_other_disorders_originating_in_the_perinatal_period/chr7_first_occurrence_p90_p96_other_disorders_originating_in_the_perinatal_period_NFE.txt.gz: invalid compressed data--length error

md5sum values of the files and the values in the list matched. This means that possibly files are corrupted on the source.

karatugo avatar Jun 21 '24 11:06 karatugo

submitted format jobs in SLURM using gwas-ssf format command for binary studies.

      23949330 short     temp_sbatch_script.sh                                   spotbot  R  2:30       1      hl-codon-130-01
      23949331 short     temp_sbatch_script.sh                                   spotbot  R  2:30       1      hl-codon-bm-10
      23949332 short     temp_sbatch_script.sh                                   spotbot  R  2:30       1      hl-codon-bm-10
      23949333 short     temp_sbatch_script.sh                                   spotbot  R  2:30       1      hl-codon-bm-10

karatugo avatar Jun 21 '24 13:06 karatugo

Job codon-slurm.23972158: compare-md5sum-quant-private-ftp Began

karatugo avatar Jun 21 '24 15:06 karatugo

Job codon-slurm.23972158: compare-md5sum-quant-private-ftp complete. No issues with md5sum values for Quant files in the private ftp and aws/formatted/gwas_summary_stats_quant.

karatugo avatar Jun 24 '24 12:06 karatugo