genome-grist
genome-grist copied to clipboard
Errors if genome-grist run on Marine metagenomes
There were a few errors that happened during genome-grist
run of Marine metagenomes:
less ~/assloss/grist/marine21/jobs/grist.j56313129.err
- SRR9178284, error in rule
samtools_count_wc
,bam_to_depth_wc
,bam_to_fastq_wc
...
[Thu Nov 24 07:14:11 2022]
Error in rule samtools_count_wc:
jobid: 27756
output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/50a874ea6fd99a2f81d96884a9de6c9e
shell:
samtools view -c -F 260 outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/
mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Activating conda environment: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
Removing output files of failed job samtools_count_wc since they might be corrupted:
outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
Job failed, going on with independent jobs.
...
Error in rule bam_to_depth_wc:
jobid: 26167
output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.depth.txt
conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/9048b0b8e113b3e7b4e477f4051b67a7
Job failed, going on with independent jobs.
shell:
samtools depth -aa outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/mappi
ng/SRR9178284.x.GCF_902527765.1.depth.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
...
Error in rule bam_to_fastq_wc:
jobid: 28547
output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.mapped.fq.gz
conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/9048b0b8e113b3e7b4e477f4051b67a7
shell:
samtools bam2fq outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam | gzip > outputs.marine21_samples/m
apping/SRR9178284.x.GCF_902527765.1.mapped.fq.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
- Error in rule
download_matching_genome_wc
...
downloading genome for ident GCF_000014265.1/Trichodesmium erythraeum IMS101 from NCBI...
ESC[32m[Thu Nov 24 07:14:13 2022]ESC[0m
ESC[31mError in rule download_matching_genome_wc:ESC[0m
ESC[31m jobid: 0ESC[0m
ESC[31m output: genbank_cache/GCF_000173095.1_genomic.fna.gzESC[0m
ESC[31mESC[0m
ESC[32m[Thu Nov 24 07:14:13 2022]ESC[0m
ESC[31mError in rule download_matching_genome_wc:ESC[0m
ESC[31m jobid: 0ESC[0m
ESC[31m output: genbank_cache/GCF_000014265.1_genomic.fna.gzESC[0m
ESC[31mESC[0m
ESC[31mRuleException:
HTTPError in line 1063 of /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile:
HTTP Error 404: Not Found
File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1063, in __rule_download_matching_genome_wc
File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 214, in urlopen
File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 523, in open
File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 632, in http_response
File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 561, in error
File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 494, in _call_chain
File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 641, in http_error_default
File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/concurrent/futures/thread.py", line 58, in runESC[0m
ESC[31mRuleException:
HTTPError in line 1063 of /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile:
HTTP Error 404: Not Found
File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1063, in __rule_dow
nload_matching_genome_wc
...
- SRR11922358, error in rule
make_mapping_notebook_wc
:
...
Error in rule make_mapping_notebook_wc:
jobid: 90
output: outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb, outputs.marine21_samples/reports/report-mappin
g-SRR11922358.html
conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
shell:
papermill /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-mappin
g.ipynb outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb -k genome_grist -p sample_id SRR11922
358 -p render '' -p outdir outputs.marine21_samples --cwd outputs.marine21_samples/reports/
python -m nbconvert outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb --to html --stdout --no-input
--ExecutePreprocessor.kernel_name=genome_grist > outputs.marine21_samples/reports/report-mapping-SRR11922358.html
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
- SRR13449930, error in rule
make_mapping_notebook_wc
:
...
Error in rule make_mapping_notebook_wc:
jobid: 167
output: outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb, outputs.marine21_samples/reports/report-mappin
g-SRR13449930.html
conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
shell:
papermill /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-mappin
g.ipynb outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb -k genome_grist -p sample_id SRR13449
930 -p render '' -p outdir outputs.marine21_samples --cwd outputs.marine21_samples/reports/
python -m nbconvert outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb --to html --stdout --no-input
--ExecutePreprocessor.kernel_name=genome_grist > outputs.marine21_samples/reports/report-mapping-SRR13449930.html
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
I am getting the same error, grist cannot download a specific genome. In my case is GCF_006715245.1
When I checked the status of the genome in the (NCBI ftp)[https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/715/245/GCF_006715245.1_ASM671524v1/], the genome is missing!.
There I found a message saying that the assembly status is suppressed. Therefore it make sense that it fails. My suggestion is to add a line to handle that error. I will try to put some code.
This is the relevant section in the Snakefile
# download actual genomes from genbank!
rule download_matching_genome_wc:
input:
csvfile = ancient(f'{GENBANK_CACHE}/{{ident}}.info.csv')
output:
genome = f"{GENBANK_CACHE}/{{ident}}_genomic.fna.gz"
run:
rows = list(load_csv(input.csvfile))
assert len(rows) == 1
row = rows[0]
ident = row['ident']
assert wildcards.ident.startswith(ident)
url = row['genome_url']
name = row['display_name']
print(f"downloading genome for ident {ident}/{name} from NCBI...",
file=sys.stderr)
with open(output.genome, 'wb') as outfp:
with urllib.request.urlopen(url) as response:
content = response.read()
outfp.write(content)
print(f"...wrote {len(content)} bytes to {output.genome}",
file=sys.stderr)
My solution: Change the snakefile from: Starting at line1062
with open(output.genome, 'wb') as outfp:
with urllib.request.urlopen(url) as response:
content = response.read()
outfp.write(content)
print(f"...wrote {len(content)} bytes to {output.genome}",
file=sys.stderr)
to:
with open(output.genome, 'wb') as outfp:
try:
with urllib.request.urlopen(url) as response:
content = response.read()
outfp.write(content)
print(f"...wrote {len(content)} bytes to {output.genome}",
file=sys.stderr)
except:
print(f"Genome not found for {ident}/{name}, skipping it",
file=sys.stderr)
pass
another genome missing: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/205/GCF_000020205.1_ASM2020v1/
... [Thu Nov 24 07:14:11 2022] Error in rule samtools_count_wc: jobid: 27756 output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/50a874ea6fd99a2f81d96884a9de6c9e shell:
samtools view -c -F 260 outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/
mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
in this one, it's hard to know what the error is because it occurred above the copy/paste - can you try rerunning with the target outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
in place of any of the other targets (gather_reads
or summarize_mapping
or whatnot)?
Error in rule make_mapping_notebook_wc: jobid: 90 output: outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb, outputs.marine21_samples/reports/report-mappin g-SRR11922358.html
I think I fixed this one in https://github.com/dib-lab/genome-grist/pull/242 which is now released in v0.9.1! So if you pip install -U genome-grist
it should run!
I am getting the same error, grist cannot download a specific genome. In my case is GCF_006715245.1
Thanks @carden24! I have some ideas here - I don't want to just ignore the missing genomes... more in a bit.
Here's one way I'm thinking of support "missing" genomes -
https://github.com/dib-lab/genome-grist/pull/255
I like the idea of requiring that they be added manually (or at least that manual acknowledgement be made).
A different or additional approach would be to suggest downloading them or a replacement manually and making it part of a private database.
#255 is maturing. I'd be interested in your thoughts @carden24 @jeanzzhao
I spent some time trying to get the data from other sources and I could not get it from the genbank or Gold but it is available from the JGI portal. I assume that this will not be the case for the other genomes so I am thinking than an alternative is to get another closely related genome based maybe on ANI or some other measurement of genome similarity.
usually the genome has been removed for a good reason. I would probably go use GTDB or NCBI taxonomy to find another genome from the same species.
Yes, totally agree, the criteria for removal from the NCBI can vary and there is no way to know programatically.
I've just released genome-grist v0.9.2. pip install -U genome-grist
should upgrade.
This includes skip_genomes
- from the configuration page,
# skip_genomes: identifiers to ignore when they show up in gather output.
# This is useful when the sourmash database contains genomes that are no
# longer present in GenBank because they have been deprecated or suppressed.
#
# Note, in such cases you should try to find a new genome to include in
# a local database!
#
# DEFAULT: []
skip_genomes: []
You can use something like:
skip_genomes:
- GCF_000020205.1
to give it a try.
I upgraded grist to 0.9.2 and run it again but snakemake if failing because it expects to have the genome downloaded as required in the rule output. I used the skip_genomes option in the config file and it was read successfully but cannot handle the missing output.
`[Tue Dec 6 08:40:45 2022] rule download_matching_genome_wc: input: genbank_cache/GCF_006715245.1.info.csv output: genbank_cache/GCF_006715245.1_genomic.fna.gz jobid: 144 reason: Missing output files: genbank_cache/GCF_006715245.1_genomic.fna.gz wildcards: ident=GCF_006715245.1 resources: tmpdir=/tmp
samples: ['Mock_T0_3_S3', 'Mock_T0_2_S2', 'Mock_T0_1_S1'] outdir: grist base_tempdir: /tmp/tmpf8o7ziq2 ['GCF_006715245.1'] Building DAG of jobs... Using shell: /bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Select jobs to execute... downloading genome for ident GCF_006715245.1/Bacillus sp. SLBN-3 from NCBI... Cannot download genome from URL: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/715/245/GCF_006715245.1_ASM671524v1/GCF_006715245.1_ASM671524v1_genomic.fna.gz Is it missing? If so, consider adding 'GCF_006715245.1' to 'skip_genomes' list in config file. [Tue Dec 6 08:40:45 2022] Error in rule download_matching_genome_wc: jobid: 0 input: genbank_cache/GCF_006715245.1.info.csv output: genbank_cache/GCF_006715245.1_genomic.fna.gz
RuleException: Exception in line 1077 of /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile: Genbank genome not found File "/home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1077, in __rule_download_matching_genome_wc File "/home/mixtures/miniconda3/envs/grist/lib/python3.9/concurrent/futures/thread.py", line 58, in run Removing output files of failed job download_matching_genome_wc since they might be corrupted: genbank_cache/GCF_006715245.1_genomic.fna.gz Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2022-12-06T084040.101718.snakemake.log`
hi @carden24 just to confirm, did you add it to skip_genomes
in the config file?
skip_genomes:
- GCF_006715245.1
should I remove file like
GCF_000173095.1.info.csv
undergenbank_cache/
before starting the re-run?
Yes on the config file. but I think I needed to clean files before rerunning. Originally I run genome-grist , and once it failed because of the missing error, I added the skip_genomes option and tried to rerun. I have now removed al the output folders and now it works. Seemed like IGNORE_IDENT variable is used in earlier steps and that is why it kept looking for those genome. Thanks a lot for the help.
@carden24 could you remind me which files you cleaned? thanks
I removed the genbank_cache folder, the gather one, and the sig one too. Not sure if all of them were required.
hmm, that's interesting 😓 it should be downstream of those, although removing them will certainly force recalculation of everything downstream! @jeanzzhao wait a few and I'll see if I can figure out something more precise!
Whoops, looks like I messed up the skip_genomes
code in #255 - I needed to add it in one more place. Working on a fix in https://github.com/dib-lab/genome-grist/pull/259. Apologies!
Merged #259 and released genome-grist v0.9.3. Please give it a try:
pip install -U genome-grist
(you shouldn't need to remove or edit any files to get this to work, @jeanzzhao)
-
pip install -U genome-grist
, v0.9.3., did not remove any previous file, sbatch #58672657, failed
'/home/zyzhao/assloss/grist/marine44/.snakemake/log/2022-12-08T082206.462631.snakemake.log'
Building DAG of jobs...
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Using shell: /bin/bash
Provided cores: 11
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
-------------------------------- ------- ------------- -------------
copy_sample_genomes_to_output_wc 19 1 1
download_matching_genome_wc 8 1 1
make_combined_info_csv_wc 19 1 1
make_gather_notebook_wc 19 1 1
summarize_gather 1 1 1
total 66 1 1
Select jobs to execute...
[Thu Dec 8 08:22:13 2022]
rule download_matching_genome_wc:
input: genbank_cache/GCF_000472565.1.info.csv
output: genbank_cache/GCF_000472565.1_genomic.fna.gz
jobid: 935
wildcards: ident=GCF_000472565.1
resources: tmpdir=/tmp
[Thu Dec 8 08:22:13 2022]
rule download_matching_genome_wc:
input: genbank_cache/GCF_000504225.1.info.csv
output: genbank_cache/GCF_000504225.1_genomic.fna.gz
jobid: 995
wildcards: ident=GCF_000504225.1
resources: tmpdir=/tmp
[Thu Dec 8 08:22:13 2022]
rule download_matching_genome_wc:
input: genbank_cache/GCF_000020205.1.info.csv
output: genbank_cache/GCF_000020205.1_genomic.fna.gz
jobid: 1003
wildcards: ident=GCF_000020205.1
resources: tmpdir=/tmp
[Thu Dec 8 08:22:13 2022]
rule download_matching_genome_wc:
input: genbank_cache/GCF_000472605.1.info.csv
output: genbank_cache/GCF_000472605.1_genomic.fna.gz
jobid: 983
wildcards: ident=GCF_000472605.1
resources: tmpdir=/tmp
[Thu Dec 8 08:22:13 2022]
rule download_matching_genome_wc:
input: genbank_cache/GCF_000701385.1.info.csv
output: genbank_cache/GCF_000701385.1_genomic.fna.gz
jobid: 1737
wildcards: ident=GCF_000701385.1
resources: tmpdir=/tmp
[Thu Dec 8 08:22:13 2022]
rule download_matching_genome_wc:
input: genbank_cache/GCF_000173095.1.info.csv
output: genbank_cache/GCF_000173095.1_genomic.fna.gz
jobid: 1469
wildcards: ident=GCF_000173095.1
resources: tmpdir=/tmp
[Thu Dec 8 08:22:14 2022]
rule download_matching_genome_wc:
input: genbank_cache/GCF_000597705.1.info.csv
output: genbank_cache/GCF_000597705.1_genomic.fna.gz
jobid: 2329
wildcards: ident=GCF_000597705.1
resources: tmpdir=/tmp
[Thu Dec 8 08:22:14 2022]
rule download_matching_genome_wc:
input: genbank_cache/GCF_000014265.1.info.csv
output: genbank_cache/GCF_000014265.1_genomic.fna.gz
jobid: 2919
wildcards: ident=GCF_000014265.1
resources: tmpdir=/tmp
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
Complete log: /home/zyzhao/assloss/grist/marine44/.snakemake/log/2022-12-08T082206.462631.snakemake.log
I am getting an error at the make_gather_notebook_wc step. I run it with a simple sample.
`Error in rule make_gather_notebook_wc: jobid: 1 input: /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-gather.ipynb, grist/gather/Mock_T0_3_S3.gather.csv.gz, grist/gather/Mock_T0_3_S3.genomes.info.csv, grist/.kernel.set output: grist/reports/report-gather-Mock_T0_3_S3.ipynb, grist/reports/report-gather-Mock_T0_3_S3.html conda-env: /home/mixtures/erick_dev/GC_Test/Test4/.snakemake/conda/3661d3423026d9d473032c65ccc8aec6_ shell:
papermill /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-gather.ipynb grist/reports/report-gather-Mock_T0_3_S3.ipynb -k genome_grist -p sample_id Mock_T0_3_S3 -p render '' -p outdir grist --cwd grist/reports/
python -m nbconvert grist/reports/report-gather-Mock_T0_3_S3.ipynb --to html --stdout --no-input --ExecutePreprocessor.kernel_name=genome_grist > grist/reports/report-gather-Mock_T0_3_S3.html
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job make_gather_notebook_wc since they might be corrupted: grist/reports/report-gather-Mock_T0_3_S3.ipynb `
This is the folder structure of the grist folder:
grist ├── gather │ ├── Mock_T0_3_S3.gather.csv.gz │ ├── Mock_T0_3_S3.gather.out │ ├── Mock_T0_3_S3.genomes.info.csv │ ├── Mock_T0_3_S3.known.sig.zip │ ├── Mock_T0_3_S3.matches.sig.zip │ ├── Mock_T0_3_S3.prefetch.csv.gz │ └── Mock_T0_3_S3.unknown.sig.zip ├── genomes │ ├── GCF_000009045.1_genomic.fna.gz │ ├── GCF_000009045.1.info.csv │ ├── GCF_000012905.2_genomic.fna.gz │ ├── GCF_000012905.2.info.csv │ ├── GCF_000219605.1_genomic.fna.gz │ ├── GCF_000219605.1.info.csv │ ├── GCF_000238915.1_genomic.fna.gz │ ├── GCF_000238915.1.info.csv │ ├── GCF_000368145.1_genomic.fna.gz │ ├── GCF_000368145.1.info.csv │ ├── GCF_000368685.1_genomic.fna.gz │ ├── GCF_000368685.1.info.csv │ ├── GCF_001042485.2_genomic.fna.gz │ ├── GCF_001042485.2.info.csv │ ├── GCF_001646745.1_genomic.fna.gz │ ├── GCF_001646745.1.info.csv │ ├── GCF_900215245.1_genomic.fna.gz │ └── GCF_900215245.1.info.csv ├── raw │ ├── Mock_T0_3_S3_1.fastq.gz │ └── Mock_T0_3_S3_2.fastq.gz ├── sigs │ └── Mock_T0_3_S3.trim.sig.zip └── trim ├── Mock_T0_3_S3.trim.fq.gz ├── Mock_T0_3_S3.trim.html └── Mock_T0_3_S3.trim.json
pip install -U genome-grist
, v0.9.3., did not remove any previous file, sbatch #58672657, failed
Hi Jean, I took a look at ~assloss/grist/marine44/
and tried running one of your samples as below - so far it's working. I wonder if you "just" need to add more skip_genomes? It's annoying to figure out, I know... I'll seek additional solutions!
samples:
- SRR5915428
outdir: outputs.jean/
sourmash_databases:
- gtdb-rs207.genomic.k31.zip
skip_genomes:
- GCF_000472605.1
- GCF_000504225.1
Hi Titus,
- I realized that I did not have
rs207
in the folder when I changedconf.yml
tors207
. -
curl -L https://osf.io/w4bcm/download -o gtdb-rs207.genomic-reps.k31.sbt.zip
- re-run, sbatch #58926209, failed after ~9h with different Error in "rule make_combined_info_csv_wc"
refer to this for details: https://hackmd.io/DOWP1qUzTCqdihYOSyp5Zg?view#12922 -Jean
On Tue, Dec 13, 2022 at 7:46 PM C. Titus Brown @.***> wrote:
pip install -U genome-grist, v0.9.3., did not remove any previous file, sbatch #58672657, failed
Hi Jean, I took a look at ~assloss/grist/marine44/ and tried running one of your samples as below - so far it's working. I wonder if you "just" need to add more skip_genomes? It's annoying to figure out, I know... I'll seek additional solutions!
samples:
- SRR5915428 outdir: outputs.jean/
sourmash_databases:
- gtdb-rs207.genomic.k31.zip
skip_genomes:
- GCF_000472605.1
- GCF_000504225.1
— Reply to this email directly, view it on GitHub https://github.com/dib-lab/genome-grist/issues/241#issuecomment-1350352382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWMYGGCWL45JVAF4AWMYFJDWNE7I5ANCNFSM6AAAAAASMSV76I . You are receiving this because you were mentioned.Message ID: @.***>
I am still having issues with this error. I think that there are still some rules that need to incorporate a check to ignore genomes that cannot be downloaded.
These rules correctly ignore the missing genome specified in the yaml:
download_matching_genome
make_genbank_info_csv
bam_to_depth_wc
minimap_wc
samtools_mpileup_wc
samtools_count_wc
bam_to_fastq_wc
The first rule that is creating an error is extract_leftover_reads_wc. I checked its code and it seems that it uses as input the gather_csv file but it does not check for the flagged genomes in the python script substract_gather.py
input:
csv = f'{outdir}/gather/{{sample}}.gather.csv.gz',
mapped = Checkpoint_GatherResults(f"{outdir}/mapping/{{sample}}.x.{{ident}}.mapped.fq.gz"),
These other rules also used that csv as input make_gather_notebook_wc - > Uses papermill and report-gather.ipynb make_mapping_notebook_wc -> Uses papermill and report-mapping.ipynb .
A possible solution would be to pass as an argument the list of flagged genomes (IGNORE_IDENTS) to the python script when it is loading the list of genomes from the csv
Line 29:
with gzip.open(args.gather_csv, "rt") as fp:
r = csv.DictReader(fp)
for row in r:
rows.append(row)
print(f"...loaded {len(rows)} results total.")
print("checking input/output pairs:")
pairs = []
fail = False
for row in rows:
acc = row["name"].split()[0]
>>>if acc in IGNORE_IDENTS:
>>> continue
>>> print("Ignoring {acc} ")
>>>else:
filename = f"{outdir}/mapping/{sample_id}.x.{acc}.mapped.fq.gz"
overlapping = f"{outdir}/mapping/{sample_id}.x.{acc}.overlap.fq.gz"
leftover = f"{outdir}/mapping/{sample_id}.x.{acc}.leftover.fq.gz"
if not os.path.exists(filename):
print(f"ERROR: input filename {filename} does not exist. Will exit.")
fail = True
pairs.append((acc, filename, overlapping, leftover))
I don't know enough about python notebooks to suggest a solution there.