pgsc_calc
pgsc_calc copied to clipboard
Fail of run_ancestry process INTERSECT_THINNED
Description of the bug
I am having a bug during run_ancestry calculation. Any idea what does it mean?
PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:INTERSECT_THINNED (R01axy)'
Caused by:
Process PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:INTERSECT_THINNED (R01axy) terminated with an error exit status (3)
[...]
Command error: gzip: gzip: R01axy_ALL_matched.txt.gz: No such file or directory GRCh38_1000G_ALL_thinned.prune.in.gz: No such file or directory Error: Failed to open GRCh38_R01axy_ALL.pgen : No such file or directory. PLINK v2.00a3.3 SSE4.2 (3 Jun 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to extracted/GRCh38_R01axy_ALL_extracted.log. Options in effect: --allow-extra-chr --chr 1-22 --extract R01axy_shared.txt --make-pgen --memory 16384 --out extracted/GRCh38_R01axy_ALL_extracted --pfile GRCh38_R01axy_ALL vzs --seed 31 --sort-vars --threads 1
Command used and terminal output
sudo nextflow run pgscatalog/pgsc_calc -r v2.0.0-alpha.4 -profile docker -params-file nextflow_ancestry.yaml --max_memory "32GB"
Relevant files
System information
Ubuntu 22.04, Nextflow 23.10.1
Sorry you're having trouble. I don't normally see problems with this process. A few questions:
- Are you able to run the workflow OK without
--run_ancestry? - What does
nextflow_ancestry.yamlcontain? - Does removing
sudodo anything? nextflow should run as a normal user application - If you're able to run the workflow OK without ancestry adjustment, could you try deleting the
workdirectory and retrying ancestry adjustment? In some situations the cache can enter a bad state. Deleting helps reset it
Are you able to run the workflow OK without --run_ancestry? – yes What does nextflow_ancestry.yaml contain?
input: "sampleset_imputed.csv" target_build: "GRCh38" run_ancestry: "pgsc_1000G_v1.tar.zst" pgs_id:"PGS002205,PGS002315"
Does removing sudo do anything? It does "Permission denied" error, I need to run nextflow as sudo. could you try deleting the work directory and retrying ancestry adjustment?: yes, already tried, I have still the same error.
That's strange. The only thing that looks odd to me is sudo. Our tests never run nextflow as root - only on user accounts - so perhaps there's a problem hiding there somewhere.
I'm not sure what permissions problem you're having, but it should be quick to install nextflow on your user account. You may need to configure docker to run correctly on a non-root user as well.
Hi, I debugged the sudo. I needed to follow this troubleshooting guide: https://barrydigby.github.io/Introduction/Nextflow
Now my command is on the user level
nextflow run pgscatalog/pgsc_calc -r v2.0.0-alpha.4 -profile docker -params-file nextflow_ancestry.yaml --max_memory "32GB"
Still I am getting the same error in INTERSECT_THINNED step. The permissions weren't the root cause.
Sorry, I'm not sure what's going wrong. Our test suite is passing and we're able to run the ancestry calculations in quite a few different environments. Tracking down the precise problem may be a little tricky.
Could you check if the work directory /data/work/8c/15b6db57aeab8b98ade3e3ff94d644 contains data that look sensible?
For example:
- What files are present in the directory?
- What happens if you check the input variants to the process:
wc -l <(gzcat R01axy_ALL_matched.txt.gz)wc -l <(gzcat GRCh38_1000G_ALL_thinned.prune.in.gz)?wc -l R01axy_ALL_matched_thinned.txt
@mireklzicar It also might be worth trying again with the latest release
@mireklzicar was there a resolution for this on your end? My run_ancestry process is failing at the same step (intersect_thinned) but confusingly, Command exit status: 0.
Apologies if I'm hijacking the thread, but if it's of interest, here are logs from trying before and after rebooting per https://github.com/PGScatalog/pgsc_calc/issues/155#issuecomment-1733329042. I verified that I am able to run docker run hello-world from my user.
nextflow_test30.log rebooted_nextflow_test30.log
The work directory only contains .command.(out/err/run/sh etc.) files and .exitcode
After running bash .command.run in the work directory,
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to extracted/GRCh37_testtirtya_ALL_vcf_extracted.log.
Options in effect:
--allow-extra-chr
--chr 1-22
--extract testtirtya_shared.txt
--make-pgen
--memory 24576
--out extracted/GRCh37_testtirtya_ALL_vcf_extracted
--pfile GRCh37_testtirtya_ALL_vcf vzs
--seed 31
--sort-vars
--threads 1
Start time: Tue Apr 2 15:06:03 2024
32082 MiB RAM detected; reserving 24576 MiB for main workspace.
Using 1 compute thread.
30 samples (0 females, 0 males, 30 ambiguous; 30 founders) loaded from
GRCh37_testtirtya_ALL_vcf.psam.
13593760 out of 14111439 variants loaded from
GRCh37_testtirtya_ALL_vcf.pvar.zst.
Note: No phenotype data present.
--extract: 56260 variants remaining.
56260 variants remaining after main filters.
Writing extracted/GRCh37_testtirtya_ALL_vcf_extracted.pvar ... done.
Writing extracted/GRCh37_testtirtya_ALL_vcf_extracted.psam ... done.
Writing extracted/GRCh37_testtirtya_ALL_vcf_extracted.pgen ... done.
End time: Tue Apr 2 15:06:15 2024
/home/ph/work/3d/bd2f44693958a032f266680ad59090/.command.sh: line 48: warning: here-document at line 45 delimited by end-of-file (wanted `END_VERSIONS')
ls: cannot access 'GRCh37_testtirtya_ALL_extracted.pgen': No such file or directory
ls: cannot access 'GRCh37_testtirtya_ALL_extracted.pvar.gz': No such file or directory
ls: cannot access 'GRCh37_testtirtya_ALL_extracted.psam': No such file or directory
is the output. There is no extracted directory because I think it is removed in the process, and of course, no GRCh37_testtirtya_ALL_extracted.p* files as ls can't find them. Appreciate any thoughts on this.
I think you have to try and delete the work and cache and try again, it might be names that are colliding. Does your sampleset name (in samplesheet) have underscores in it? If so you should change them as they aren't allowed.
@smlmbrt Thanks for the replies here and https://github.com/PGScatalog/pgsc_calc/issues/271#issuecomment-2032345030. I tried with a deleted work directory before, but not the cache, so I tried that and ran everything, only to stop at the same step with the same error (and command exit status).
sampleset name also does not contain any underscores (it used to and gave trouble, so we changed that).
@smlmbrt Wanted to give a quick follow up that using v2.0.0-alpha.4 enabled successful completion of pipeline. Trying v2.0.0-alpha.5 on the same command leads to the aforementioned failure.
If there are any ways I can help troubleshoot, please let me know, thank you.
But when you ran v2.0.0-alpha.5 were you using an existing cache and shared work directory that might have also been used by v2.0.0-alpha.4? This can cause problems.
@smlmbrt A colleague and I encountered the same thing independently. Rolling back to a previous version, the process runs though, but on the new version, it results in the "No such file or directory" error on the ancestry step
Thanks all for the bug report, can I check whether your genotypes are split by chromosome or only a single combined file? And is it with a VCF or plink files? @nebfield might have other ideas.
Above:
[...] Start time: Tue Apr 2 15:06:03 2024 32082 MiB RAM detected; reserving 24576 MiB for main workspace. Using 1 compute thread. 30 samples (0 females, 0 males, 30 ambiguous; 30 founders) loaded from GRCh37_testtirtya_ALL_vcf.psam. 13593760 out of 14111439 variants loaded from GRCh37_testtirtya_ALL_vcf.pvar.zst. Note: No phenotype data present. --extract: 56260 variants remaining. 56260 variants remaining after main filters. Writing extracted/GRCh37_testtirtya_ALL_vcf_extracted.pvar ... done. Writing extracted/GRCh37_testtirtya_ALL_vcf_extracted.psam ... done. Writing extracted/GRCh37_testtirtya_ALL_vcf_extracted.pgen ... done. End time: Tue Apr 2 15:06:15 2024 /home/ph/work/3d/bd2f44693958a032f266680ad59090/.command.sh: line 48: warning: here-document at line 45 delimited by end-of-file (wanted `END_VERSIONS') ls: cannot access 'GRCh37_testtirtya_ALL_extracted.pgen': No such file or directory ls: cannot access 'GRCh37_testtirtya_ALL_extracted.pvar.gz': No such file or directory ls: cannot access 'GRCh37_testtirtya_ALL_extracted.psam': No such file or directory
#282:
[...] 27904794 variants loaded from GRCh37_newautosomal_ALL_vcf.pvar.zst. Note: No phenotype data present. --extract: 59338 variants remaining. 59338 variants remaining after main filters. Writing extracted/GRCh37_newautosomal_ALL_vcf_extracted.pvar ... done. Writing extracted/GRCh37_newautosomal_ALL_vcf_extracted.psam ... done. Writing extracted/GRCh37_newautosomal_ALL_vcf_extracted.pgen ... done. /home/ubuntu/user/run/test/work/2e/d2a0b6356ba7f6f59e81f36d63d7cd/.command.sh: line 48: warning: here-document at line 45 delimited by end-of-file (wanted `END_VERSIONS') ls: cannot access 'GRCh37_newautosomal_ALL_extracted.pgen': No such file or directory ls: cannot access 'GRCh37_newautosomal_ALL_extracted.pvar.gz': No such file or directory ls: cannot access 'GRCh37_newautosomal_ALL_extracted.psam': No such file or directory
Both are caused by VCF being in the output name, and not being in the expected output. Offending lines are: https://github.com/PGScatalog/pgsc_calc/blob/8bdf287d558a7abe1ef86e961337df71d7289d0d/modules/local/ancestry/oadp/intersect_thinned.nf#L76-L87
Our ancestry VCF test wasn't testing VCFs 👀 It was accidentally using plink2 files as input.
When I fixed the test it reproduced the reported error. After making some changes the test passes on the dev branch now.
If you'd like to test the latest changes you could try re-running your workflows with the parameter -r dev
We'll do a new patch release soon including these fixes 😄 Thanks for the bug reports everybody!
@nebfield running with -r dev fixed it for me. (But now, the file_pgs.txt.gz results only have the SUM column, and the other columns, Z_MostSimilarPop, Z_norm1, and Z_norm2, are empty. I'm testing if this is new in this version or if it is something else.)