pgsc_calc icon indicating copy to clipboard operation
pgsc_calc copied to clipboard

Possible bug: pipeline only calculates one score of many in --scorefile parameter

Open csjohnson23 opened this issue 1 year ago • 1 comments

Description of the bug

Hello!

It seems that running the pipeline with --scorefile path/to/scores/*.txt.gz results in only one score actually being calculated. In the score report, the nextflow command run correctly lists all 47 scores that match the wildcard path, but in the dataset_pgs.txt.gz output file, only the very first score is present.

Ultimately, I need to calculated >1000 scores and I'm hoping to do so in parallel with pgsc_calc. I need to use the --scorefile parameter, as the pipeline fails to download scores on the cluster I'm using. Please let me know if there's a way around this, and thanks for your help!

_Note: I set min_overlap extremely low to make sure all scores are calculated. _

Command used and terminal output

Command I ran: 

export NXF_ANSI_LOG=false
export NXF_OPTS="-Xms500M -Xmx2G"

module load nextflow
module load singularity/3.7.0

nextflow run pgscatalog/pgsc_calc -r main -latest \
    -profile singularity \
    -resume \
    --scorefile downloads/PGP000604/*.txt.gz \
    --genotypes_cache genotypes_cache \
    --input samplesheet.csv \
    --min_overlap 0.1 \
    --target_build GRCh37 \
    --run_ancestry ~/pgs_calc/pgsc_HGDP+1kGP_v1.tar.zst \
    -c ~/configs/pgs_calc_specific_slurm.config 

Command reported in the score report: 

nextflow run pgscatalog/pgsc_calc -r main -latest -profile singularity -resume \
    --scorefile downloads/PGP000604/PGS004701_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004705_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004707_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004711_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004715_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004719_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004723_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004727_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004731_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004733_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004737_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004739_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004743_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004747_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004751_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004755_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004759_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004761_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004765_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004767_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004769_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004771_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004775_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004779_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004783_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004785_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004789_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004791_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004795_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004797_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004801_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004805_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004809_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004811_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004815_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004817_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004821_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004825_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004829_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004833_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004835_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004837_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004841_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004845_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004851_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004853_hmPOS_GRCh37.txt.gz \
    downloads/PGP000604/PGS004855_hmPOS_GRCh37.txt.gz --genotypes_cache \
    genotypes_cache --input samplesheet.csv --min_overlap 0.1 --target_build GRCh37 \
    --run_ancestry /home/user/pgs_calc/pgsc_HGDP+1kGP_v1.tar.zst -c \
    /home/user/configs/pgs_calc_specific_slurm.config

Relevant files

Content of results/PROFILE2024/match/PROFILE2024_summary.csv: dataset,accession,score_pass,match_status,ambiguous,is_multiallelic,duplicate_best_match,duplicate_ID,match_flipped,match_IDs,count,percent PROFILE2024,PGS004701_hmPOS_GRCh37,true,matched,false,false,false,false,false,true,823218,74.30254881413667 PROFILE2024,PGS004701_hmPOS_GRCh37,true,unmatched,,,,,,,284709,25.69745118586333

Score report: failed_parallel_score_report.html.zip

System information

nextflow: nextflow/24.04.2.5914 Hardware: HPC Executor: slurm container engine: singularity OS: CentOS 7 Version of pgsc_calc: 2.0.0-beta.2

csjohnson23 avatar Aug 07 '24 15:08 csjohnson23

You need to use " characters when using a wildcard, e.g.:

--scorefile "downloads/PGP000604/*.txt.gz"

Without quotes your shell expands the wildcard character into a list of file paths, which stops multiple scoring files from being detected correctly

nebfield avatar Aug 07 '24 15:08 nebfield