goci
goci copied to clipboard
UKB sumstats wrangling & ingest
Data associated with this project https://www.medrxiv.org/content/10.1101/2023.12.06.23299426v1 has been shared with Open Targets and needs ingesting into the GWAS Catalog. The data is presented in separate files for each chromosome, looks like ~35M variants per GWAS.
- [x] Liaise with David & Daniel at Open Targets to get access to the data (currently in their cloud storage)
- [x] Combine chromosome-specific files into a single file per GWAS (study)
- [x] Quant
- [x] Binary - restarted merge for corrupt studies
- [x] Reformat to GWAS-SSF
- [x] Quant - find at
/hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats_quant/
- [x] Binary - find at
/hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats/
- [x] Quant - find at
- [x] Check file integrity of formatted files - find them at
/hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/
- [ ] Clean up the rows with
TEST_FAIL
in the EXTRA column - SLURM jobs done but there are errors e.g. files with some chrs missing:- [ ] Investigate files with some chromosomes missing
- [x] Verify that the validation pipeline can cope with files of this size
- [x] Quant - sample file passed the validation
- [ ] Wrangle the metadata template (see email attachment) to match the files, i.e., copy files to private ftp and compare md5sum values of the ones in private ftp and
aws/formatted_long/
- [ ] Quant - Quant file copy complete and compared the md5sum values of Quant files in the private ftp and
aws/formatted/gwas_summary_stats_quant
-- done this but need to restart again as we formatted files again - [ ] Binary - Copying files to the private ftp
- [ ] Quant - Quant file copy complete and compared the md5sum values of Quant files in the private ftp and
- [ ] Create submission on behalf of the author for immediate release (not under embargo)
- [ ] Quant - Same error 2nd time, Yue will help with the error investigation https://app.zenhub.com/workspaces/gwas-59df823c4a6feb3786810391/issues/gh/ebispot/goci/1267#issuecomment-2188608204
- [ ] Binary
- [ ] Before queuing for harmonisation, harmonise one file and check variant dropout rate. Discuss results before proceeding.
- [ ] Quant
- [ ] Binary
@earlEBI can provide support in interpreting the template and especially with the template wrangling and submission steps.
Started copying files to /hps/nobackup
65202370 standard goci1267 spotbot R 1:13:10 1 hl-codon-111-03
Copying complete. Now comparing MD5 checksums of the copied files to those listed in md5sums.txt.
Submitted a SLURM job to calculate and compare md5sums of files in GCP and Codon.
66126819 standard md5sum-c spotbot R 01:26 1 hl-codon-09-03
I calculated and compared the MD5 sums of files in GCP and Codon. They matched.
Submitted two SLURM jobs to combine chromosome-specific files into a single file per GWAS (study):
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
66784780 datamover merge spotbot R 2:20 1 codon-dm-05
162292 datamover merge-re spotbot R 0:24 1 codon-dm-05
Note that all .txt.gz
files are in gwas_summary_stats
and all .regenie.gz
files are in gwas_summary_stats_quant
.
I've noticed that some file MD5 sums are missing, for example for ./gwas_summary_stats/j92/. Additionally, there are warnings indicating that these files are corrupt when I try to combine them.
I did not detect this issue earlier because the affected files lack entries in md5sums.txt. I have reached out to Annalisa about this.
I have access to S3 now.
Submitted two SLURM jobs for the data copy to codon, namely, cp_ukb_aws_gwas_summary_stats
and cp_ukb_aws_gwas_summary_stats_quant
using the sbatch scripts /homes/spotbot/goci-1267/cp_from_aws_gwas_summary_stats.sh
and /homes/spotbot/goci-1267/cp_from_aws_gwas_summary_stats_quant.sh
Every file in our directory matches exactly with the files listed in md5sums.txt, and vice versa. See the script at /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/compare_md5sums.sh
Submitted batch job 16272725 to compute and compare md5sums of the copied files.
Compute and compare md5sums of the copied files done. The values matched.
Submitted batch job 16375825 to backup copied files.
Backup copied files complete.
Submitted batch jobs 16648847 and 16648861 to combine chromosome-specific files into a single file per GWAS (study).
16648861 - merge-regenie complete.
For regenie studies:
No files found for blood_biochemistry_oest_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_oest_0 with ancestry EAS. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry AFR. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry SAS. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry EAS. Skipping...
No files found for blood_biochemistry_urma_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_urma_0 with ancestry EAS. Skipping...
Submitted a gwas-ssf format
SLURM job for formatting regenie files. Expect them in here in 2 days: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted/gwas_summary_stats_quant/
16648847: merge_gwas_summary_stats Ended
Unfortunately Disk quota exceeded
for spot/gwas/scratch/
and some of the merge operations failed. I'll talk to the Codon team for how to best navigate this issue.
Talked to Codon team.
- [ ] I'll move copied files and backup files to
lts
(and talk to Storage team if we don't already have space) - [x] Restart the merge job (ideally with parallel compression)
Tested the validation with the file blood_biochemistry_ua_0_EAS_combined_formatted.regenie.gz
, it worked okay
merge job for gwas_summary_stats
resumed.
Submitted batch job 22796315
Done - a SLURM job for gathering data (e.g. md5sums of the files, calculating variant counts etc.) for the metadata template.
Submitted a SLURM job for copying Quant files to the private ftp for test submission in sandbox.
23899637 datamover cp-ukbb-quant-private-ftp spotbot R 2:57:34 1 codon-dm-04
Submission template is now ready. Lizzy updated it and fixed the errors.
Possible errors found during the merge of Binary studies:
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/n34/chr9_first_occurrence_n34_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/n34/chr9_first_occurrence_n34_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr18_first_occurrence_o35_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr18_first_occurrence_o35_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr21_first_occurrence_o35_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr21_first_occurrence_o35_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o91/chr20_first_occurrence_o91_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o91/chr20_first_occurrence_o91_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/p90_p96_other_disorders_originating_in_the_perinatal_period/chr7_first_occurrence_p90_p96_other_disorders_originating_in_the_perinatal_period_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/p90_p96_other_disorders_originating_in_the_perinatal_period/chr7_first_occurrence_p90_p96_other_disorders_originating_in_the_perinatal_period_NFE.txt.gz: invalid compressed data--length error
md5sum values of the files and the values in the list matched. This means that possibly files are corrupted on the source.
submitted format jobs in SLURM using gwas-ssf format command for binary studies.
23949330 short temp_sbatch_script.sh spotbot R 2:30 1 hl-codon-130-01
23949331 short temp_sbatch_script.sh spotbot R 2:30 1 hl-codon-bm-10
23949332 short temp_sbatch_script.sh spotbot R 2:30 1 hl-codon-bm-10
23949333 short temp_sbatch_script.sh spotbot R 2:30 1 hl-codon-bm-10
Job codon-slurm.23972158: compare-md5sum-quant-private-ftp Began
Job codon-slurm.23972158: compare-md5sum-quant-private-ftp complete. No issues with md5sum values for Quant files in the private ftp and aws/formatted/gwas_summary_stats_quant.