pgsc_calc icon indicating copy to clipboard operation
pgsc_calc copied to clipboard

APPLY_SCORE Failing on Google Cloud Batch

Open alex1craig opened this issue 6 months ago • 7 comments

Description of the bug

I am trying to run pgsc_calc on Google Cloud Batch to score chromosome files that I imputed from an ancestry.com report. Most of the pipeline is running successfully, but it fails on APPLY_SCORE:

Process `PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:PLINK2_SCORE (895a1bc3-77e6-4a28-858d-fc5d38c877e9 chromosome 12 effect type additive 0)` terminated with an error exit status (6)

The issue seems to be caused by the .psam files that the pipeline generates not being accessible:

INFO:   Error: No samples in GRCh37_895a1bc3-77e6-4a28-858d-fc5d38c877e9_12.psam.Fusion Info:    fusion_version: 2.4.11-8ead802    clone_namespace: false    kernel_version: 6.6    disk_cache_size: 368Gb    max_open_files: 1048576

This is the executed command that causes the error:

INFO:   plink2             --threads 2             --memory 8192             --seed 31             --extract 895a1bc3-77e6-4a28-858d-fc5d38c877e9_12_additive_0.scorefile.gz                          --allow-extra-chr             --score 895a1bc3-77e6-4a28-858d-fc5d38c877e9_12_additive_0.scorefile.gz zs header-read cols=+scoresums,+denom,+fid list-variants no-mean-imputation   --error-on-freq-calc             --score-col-nums 3-6             --pfile vzs GRCh37_895a1bc3-77e6-4a28-858d-fc5d38c877e9_12             --out 895a1bc3-77e6-4a28-858d-fc5d38c877e9_12_additive_0

Some things I have tried:

  • I thought the issue might be with fusion, so I tried disabling it and rerunning the pipeline. I was met with a similar error: Error: Failed to open GRCh37_895a1bc3-77e6-4a28-858d-fc5d38c877e9_12.psam : No such file or directory. However, in this case the command exit status was 3 instead of 6.

  • Messing around with the format of my original .psam files. I tried these two formats, and neither seems to make a difference:

    Format 1:

    #IID	SEX
    SAMPLE	1
    

    Format 2:

    #FID	IID	SEX
    895a1bc3-77e6-4a28-858d-fc5d38c877e9	SAMPLE	1
    

I understand pgsc_calc may run in to issues with imputed chromosome files due to lack of WGS support, but I am able to successfully run the pipeline on the same chromosome files on a local linux machine. So it seems the issue is coming from something wrong with the cloud executor and not my imputed chromosome files.

Any help would be appreciated, thanks.

Command used and terminal output

nextflow run pgscatalog/pgsc_calc \
    -profile docker \
    -c nextflow.config \
    --input "$samplesheet_path" \
    --target_build GRCh37 \
    --pgs_id "$pgs_ids" \
    -work-dir "$work_dir" \
    --format json \

Relevant files

batch-logs.json

samplesheet.json

nextflow.log

System information

nextflow.config:

// Google Cloud Batch configuration for Nextflow
process {
    // Define the executor
    executor = 'google-batch'

    // Define the container image using an environment variable
    // Fallback to a generic gcloud image if not set
    container = System.getenv('CONTAINER_IMAGE') ?: 'gcr.io/google-containers/google-cloud-cli:latest'

    cpus = 7
    memory = '28.GB'
    time = '24.h'

    // Error strategy for potential preemptions (exit code 50001 for GCE Spot VM preemption via Batch)
    errorStrategy = { task.exitStatus == 50001 ? 'retry' : 'terminate' }
    maxRetries = 3
}

// Google Cloud specific settings
google {
    // Project ID and Location (Region) obtained from environment variables
    project = System.getenv('PROJECT_ID')
    location = System.getenv('GCP_REGION')

    batch.spot = false
}

// Enable Fusion
fusion.enabled = true
// Enable Wave container service
wave.enabled = true
// Enable Tower
tower.accessToken = System.getenv('TOWER_ACCESS_TOKEN')

// Enable Docker, required for container execution
docker.enabled = true

// Scope for Nextflow execution reports
report.enabled = true
timeline.enabled = true
trace.enabled = true

// Manifest info (optional)
manifest {
    name = 'pgscatalog/pgsc_calc'
    description = 'PGS Catalog Score Calculation pipeline'
    mainScript = 'main.nf'
} 

alex1craig avatar May 05 '25 13:05 alex1craig

I am able to get the pipeline to run if I use vcf files instead of pfiles for my imputed chromosomes. This does have the downside of being a bit slower, and not allowing for X chromosome files, but it works. If I include an X chromosome vcf file I get this error on the MAKE_COMPATIBLE step:

Error: chrX is present in the input file, but no sex information was provided;rerun this import with --psam or --update-sex.  --split-par may also beappropriate

I don't think this is a huge deal as most PGS don't use X chromosome alleles, but figured it's worth mentioning.

I'm still curious if anyone has successfully run the pipeline using pfiles as input on Google Cloud Batch or some other cloud executor?

alex1craig avatar May 05 '25 20:05 alex1craig

I am able to get the pipeline to run if I use vcf files instead of pfiles for my imputed chromosomes. This does have the downside of being a bit slower, and not allowing for X chromosome files, but it works. If I include an X chromosome vcf file I get this error on the MAKE_COMPATIBLE step:

Error: chrX is present in the input file, but no sex information was provided;rerun this import with --psam or --update-sex.  --split-par may also beappropriate

I don't think this is a huge deal as most PGS don't use X chromosome alleles, but figured it's worth mentioning.

I'm still curious if anyone has successfully run the pipeline using pfiles as input on Google Cloud Batch or some other cloud executor?

chrX failing to recode is interesting! Thanks for flagging the problem. I'll try to reproduce and make a new issue.

I run plink2 files on Google Cloud Batch quite often. From the logs I noticed that your bucket contains cached work. My first thought is that sometimes the cache gets stuck in an invalid state if a process has previously failed. Changing the work directory to a different location in the bucket might help.

nebfield avatar May 06 '25 15:05 nebfield

Thanks for the info. If you're referring to the "nf-work" directory that I set as the nextflow work-dir, I do run a script that clears all files in that directory in the bucket before each run of the pipeline.

Let me know if you need any info to reproduce the chrX error.

alex1craig avatar May 06 '25 21:05 alex1craig

@nebfield do you see anything wrong with the plink2 files that this script generates? I just want to confirm there's no special formatting required for the pipeline. The SEX_FILE is a .txt file with two columns: '#IID' and 'SEX'. Thanks.

plink2 --vcf "$vcf_file" \
       --update-sex "$SEX_FILE" \
       --split-par "b37" \
       --make-pgen \
       --out "$base_name"

alex1craig avatar May 10 '25 17:05 alex1craig

@alex1craig, that looks fine to me.

smlmbrt avatar May 13 '25 08:05 smlmbrt

I am able to get the pipeline to run if I use vcf files instead of pfiles for my imputed chromosomes. This does have the downside of being a bit slower, and not allowing for X chromosome files, but it works. If I include an X chromosome vcf file I get this error on the MAKE_COMPATIBLE step:

Error: chrX is present in the input file, but no sex information was provided;rerun this import with --psam or --update-sex.  --split-par may also beappropriate

I don't think this is a huge deal as most PGS don't use X chromosome alleles, but figured it's worth mentioning.

I'm still curious if anyone has successfully run the pipeline using pfiles as input on Google Cloud Batch or some other cloud executor?

Hello to everyone, I encounter the same error... Has somebody found a way to fix it? Thanks in advance.

ibphd-99 avatar May 21 '25 13:05 ibphd-99

@nebfield do you see anything wrong with the plink2 files that this script generates? I just want to confirm there's no special formatting required for the pipeline. The SEX_FILE is a .txt file with two columns: '#IID' and 'SEX'. Thanks.

plink2 --vcf "$vcf_file" \
       --update-sex "$SEX_FILE" \
       --split-par "b37" \
       --make-pgen \
       --out "$base_name"

@ibphd-99, you have to format the files with sex information. If your PGS doesn't have sex chromosome information in it (most don't) I would suggest removing the X chromosome from the samplesheet.

smlmbrt avatar May 21 '25 13:05 smlmbrt