pgsc_calc
pgsc_calc copied to clipboard
APPLY_SCORE Failing on Google Cloud Batch
Description of the bug
I am trying to run pgsc_calc on Google Cloud Batch to score chromosome files that I imputed from an ancestry.com report. Most of the pipeline is running successfully, but it fails on APPLY_SCORE:
Process `PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:PLINK2_SCORE (895a1bc3-77e6-4a28-858d-fc5d38c877e9 chromosome 12 effect type additive 0)` terminated with an error exit status (6)
The issue seems to be caused by the .psam files that the pipeline generates not being accessible:
INFO: Error: No samples in GRCh37_895a1bc3-77e6-4a28-858d-fc5d38c877e9_12.psam.Fusion Info: fusion_version: 2.4.11-8ead802 clone_namespace: false kernel_version: 6.6 disk_cache_size: 368Gb max_open_files: 1048576
This is the executed command that causes the error:
INFO: plink2 --threads 2 --memory 8192 --seed 31 --extract 895a1bc3-77e6-4a28-858d-fc5d38c877e9_12_additive_0.scorefile.gz --allow-extra-chr --score 895a1bc3-77e6-4a28-858d-fc5d38c877e9_12_additive_0.scorefile.gz zs header-read cols=+scoresums,+denom,+fid list-variants no-mean-imputation --error-on-freq-calc --score-col-nums 3-6 --pfile vzs GRCh37_895a1bc3-77e6-4a28-858d-fc5d38c877e9_12 --out 895a1bc3-77e6-4a28-858d-fc5d38c877e9_12_additive_0
Some things I have tried:
-
I thought the issue might be with fusion, so I tried disabling it and rerunning the pipeline. I was met with a similar error:
Error: Failed to open GRCh37_895a1bc3-77e6-4a28-858d-fc5d38c877e9_12.psam : No such file or directory. However, in this case the command exit status was3instead of6. -
Messing around with the format of my original .psam files. I tried these two formats, and neither seems to make a difference:
Format 1:
#IID SEX SAMPLE 1Format 2:
#FID IID SEX 895a1bc3-77e6-4a28-858d-fc5d38c877e9 SAMPLE 1
I understand pgsc_calc may run in to issues with imputed chromosome files due to lack of WGS support, but I am able to successfully run the pipeline on the same chromosome files on a local linux machine. So it seems the issue is coming from something wrong with the cloud executor and not my imputed chromosome files.
Any help would be appreciated, thanks.
Command used and terminal output
nextflow run pgscatalog/pgsc_calc \
-profile docker \
-c nextflow.config \
--input "$samplesheet_path" \
--target_build GRCh37 \
--pgs_id "$pgs_ids" \
-work-dir "$work_dir" \
--format json \
Relevant files
System information
nextflow.config:
// Google Cloud Batch configuration for Nextflow
process {
// Define the executor
executor = 'google-batch'
// Define the container image using an environment variable
// Fallback to a generic gcloud image if not set
container = System.getenv('CONTAINER_IMAGE') ?: 'gcr.io/google-containers/google-cloud-cli:latest'
cpus = 7
memory = '28.GB'
time = '24.h'
// Error strategy for potential preemptions (exit code 50001 for GCE Spot VM preemption via Batch)
errorStrategy = { task.exitStatus == 50001 ? 'retry' : 'terminate' }
maxRetries = 3
}
// Google Cloud specific settings
google {
// Project ID and Location (Region) obtained from environment variables
project = System.getenv('PROJECT_ID')
location = System.getenv('GCP_REGION')
batch.spot = false
}
// Enable Fusion
fusion.enabled = true
// Enable Wave container service
wave.enabled = true
// Enable Tower
tower.accessToken = System.getenv('TOWER_ACCESS_TOKEN')
// Enable Docker, required for container execution
docker.enabled = true
// Scope for Nextflow execution reports
report.enabled = true
timeline.enabled = true
trace.enabled = true
// Manifest info (optional)
manifest {
name = 'pgscatalog/pgsc_calc'
description = 'PGS Catalog Score Calculation pipeline'
mainScript = 'main.nf'
}
I am able to get the pipeline to run if I use vcf files instead of pfiles for my imputed chromosomes. This does have the downside of being a bit slower, and not allowing for X chromosome files, but it works. If I include an X chromosome vcf file I get this error on the MAKE_COMPATIBLE step:
Error: chrX is present in the input file, but no sex information was provided;rerun this import with --psam or --update-sex. --split-par may also beappropriate
I don't think this is a huge deal as most PGS don't use X chromosome alleles, but figured it's worth mentioning.
I'm still curious if anyone has successfully run the pipeline using pfiles as input on Google Cloud Batch or some other cloud executor?
I am able to get the pipeline to run if I use vcf files instead of pfiles for my imputed chromosomes. This does have the downside of being a bit slower, and not allowing for X chromosome files, but it works. If I include an X chromosome vcf file I get this error on the MAKE_COMPATIBLE step:
Error: chrX is present in the input file, but no sex information was provided;rerun this import with --psam or --update-sex. --split-par may also beappropriateI don't think this is a huge deal as most PGS don't use X chromosome alleles, but figured it's worth mentioning.
I'm still curious if anyone has successfully run the pipeline using pfiles as input on Google Cloud Batch or some other cloud executor?
chrX failing to recode is interesting! Thanks for flagging the problem. I'll try to reproduce and make a new issue.
I run plink2 files on Google Cloud Batch quite often. From the logs I noticed that your bucket contains cached work. My first thought is that sometimes the cache gets stuck in an invalid state if a process has previously failed. Changing the work directory to a different location in the bucket might help.
Thanks for the info. If you're referring to the "nf-work" directory that I set as the nextflow work-dir, I do run a script that clears all files in that directory in the bucket before each run of the pipeline.
Let me know if you need any info to reproduce the chrX error.
@nebfield do you see anything wrong with the plink2 files that this script generates? I just want to confirm there's no special formatting required for the pipeline. The SEX_FILE is a .txt file with two columns: '#IID' and 'SEX'. Thanks.
plink2 --vcf "$vcf_file" \
--update-sex "$SEX_FILE" \
--split-par "b37" \
--make-pgen \
--out "$base_name"
@alex1craig, that looks fine to me.
I am able to get the pipeline to run if I use vcf files instead of pfiles for my imputed chromosomes. This does have the downside of being a bit slower, and not allowing for X chromosome files, but it works. If I include an X chromosome vcf file I get this error on the MAKE_COMPATIBLE step:
Error: chrX is present in the input file, but no sex information was provided;rerun this import with --psam or --update-sex. --split-par may also beappropriateI don't think this is a huge deal as most PGS don't use X chromosome alleles, but figured it's worth mentioning.
I'm still curious if anyone has successfully run the pipeline using pfiles as input on Google Cloud Batch or some other cloud executor?
Hello to everyone, I encounter the same error... Has somebody found a way to fix it? Thanks in advance.
@nebfield do you see anything wrong with the plink2 files that this script generates? I just want to confirm there's no special formatting required for the pipeline. The SEX_FILE is a .txt file with two columns: '#IID' and 'SEX'. Thanks.
plink2 --vcf "$vcf_file" \ --update-sex "$SEX_FILE" \ --split-par "b37" \ --make-pgen \ --out "$base_name"
@ibphd-99, you have to format the files with sex information. If your PGS doesn't have sex chromosome information in it (most don't) I would suggest removing the X chromosome from the samplesheet.