pgsc_calc
pgsc_calc copied to clipboard
Out of memory error in MAKE_COMPATIBLE:PLINK2_VCF step when processing Illumina WGS gVCF-derived VCF
Description of the bug
I'm running pgsc_calc on a Illumina WGS (on a server with 1TB of RAM). The WGS was produced using Haplotype caller (gVCF mode) followed by GenotypeGVCFs:
java -jar /home/tools/gatk-4.6.2.0/gatk-package-4.6.2.0-local.jar GenotypeGVCFs -R Homo_sapiens_assembly38_noalt.fasta -V X2121.snps.raw.g.vcf.gz -O X2121.raw.vcf.gz --dbsnp dbSNP155_fixed.vcf.gz --include-non-variant-sites true
The command that I've used for pgsc_calc is:
nextflow run pgscatalog/pgsc_calc \ -profile docker \ --input samplesheet.csv \ --pgs_id PGS001931 \ --efo_id EFO_0009695 \ --target_build GRCh38 \ --liftover \ --hg19_chain https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz \ --hg38_chain https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz \ --run-ancestry https://ftp.ebi.ac.uk/pub/databases/spot/pgs/resources/pgsc_HGDP+1kGP_v1.tar.zst \ -c custom.config
My custom.config is:
`process {
executor = 'local'
withName: 'PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_VCF' {
memory = '300.GB'
cpus = 16
time = '48.h'
}
}` I'm obtaining this error:
`ERROR ~ Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_VCF (X2121 chromosome ALL)'
Caused by: Process terminated with an error exit status (2)
Command executed:
plink2
--threads 16
--memory 307200
--set-all-var-ids '@:#:$r:$a'
--max-alleles 2
--freq
--missing vcols=fmissdosage,fmiss
--new-id-max-allele-len 100 missing
--vcf X2121.raw.vcf
--allow-extra-chr --chr 1-22, X, Y, XY
--make-pgen vzs pvar-cols="-xheader,-maybequal,-maybefilter,-maybeinfo,-maybecm"
--out GRCh38_X2121_ALL
Command exit status: 2
Output: Out of memory.`
Thanks in advance!!!
Command used and terminal output
Relevant files
No response
System information
No response
I'd suggest converting your VCF into plink2 files before using the workflow. Your VCF file seems complicated and big!
See here for more details. In particular it would be probably help to split your VCF so that each chromosome has one file. You can do that by running:
plink2 --vcf <full_path_to_vcf.vcf.gz> \
--allow-extra-chr \
--chr <chromosome> \
-make-pgen --out <short name>_<chromosome>
for each chromosome.
Thank for your suggestion but I continue having problems. The PLINK2_SCORE process (APPLY_SCORE) terminates with exit status 2 due to an out-of-memory error, even though sufficient system RAM is available and the job is allocated 8 GB. This occurs when running the pipeline on chromosome-split WGS PLINK2 files Dataset Characteristics
Input format: PLINK2 files (.pgen, .psam, .pvar) pre-split by chromosome
For example Sites on chr1: 248,956,435 File size (chr1): .pgen: 713 MB .pvar: 6.4 GB (uncompressed ASCII) .psam: 30 bytes (1 sample) Target build: GRCh38 PGS scores: 3 (PGS001931, PGS002148, PGS003516)
Custom Configuration File Used
// nextflow.config
docker {
enabled = true
runOptions = '--user root'
}
process {
// Global retry strategy
errorStrategy = { task.exitStatus in [130,131,134,135,137,139,140,141,143,145] ? 'retry' : 'finish' }
maxRetries = 3
// Configuration for MATCH_VARIANTS (heavy process)
withName: 'PGSCATALOG_PGSCCALC:PGSCCALC:MATCH:MATCH_VARIANTS' {
// Partenza da 128 GB, scala automaticamente con i retry
memory = { 64.GB * task.attempt }
time = { 6.h * task.attempt }
cpus = 2
// Impostazione limite memoria container Docker
containerOptions = '--user root --memory=256g'
}
// Configuration for PLINK2_RELABELPVAR (medium process)
withName: 'PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR' {
memory = { 64.GB * task.attempt }
time = { 4.h * task.attempt }
cpus = 2
containerOptions = '--user root --memory=128g'
}
// General PLINK2 processes (catch-all for any PLINK2 process)
withLabel: 'process_medium' {
memory = { 64.GB * task.attempt }
cpus = 2
time = { 4.h * task.attempt }
containerOptions = '--user root'
}
}
params {
// Limiti globali consigliati per non saturare la RAM totale
max_memory = '750.GB'
max_cpus = 16
max_time = '240.h'
}
Samplesheet
sampleset,path_prefix,chrom,format
X2121-axy,/home/user/prova_pgs/X2121_axy_chr1,1,pfile
X2121-axy,/home/user/prova_pgs/X2121_axy_chr2,2,pfile
[...chr 3-22, X, Y...]
Command Used
nextflow run pgscatalog/pgsc_calc \
-profile docker \
-c custom.config \
-resume \
--input samplesheet.csv \
--pgs_id PGS001931,PGS002148,PGS003516 \
--efo_id EFO_0009695 \
--target_build GRCh38 \
--min_overlap 0.5 \
--liftover \
--hg19_chain https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz \
--hg38_chain https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz \
--run-ancestry https://ftp.ebi.ac.uk/pub/databases/spot/pgs/resources/pgsc_HGDP+1kGP_v1.tar.zst \
--outdir results
The error present in the .command.err
Error: Out of memory. The --memory flag may be helpful.
Let me know if the complete log error could be helpful!
Thanks in avdance for your help!!!
8GB RAM isn't enough for a big target genome.
You can allocate more memory to the APPLY_SCORE stage by changing your configuration:
process {
withName: 'APPLY_SCORE' {
cpus = 1
memory = 16.GB
time = 6.hour
}
}