genome-grist Errors if genome-grist run on Marine metagenomes

There were a few errors that happened during genome-grist run of Marine metagenomes: less ~/assloss/grist/marine21/jobs/grist.j56313129.err

SRR9178284, error in rulesamtools_count_wc, bam_to_depth_wc, bam_to_fastq_wc

...
[Thu Nov 24 07:14:11 2022]
Error in rule samtools_count_wc:
    jobid: 27756
    output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/50a874ea6fd99a2f81d96884a9de6c9e
    shell:

        samtools view -c -F 260 outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/
mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
  
Activating conda environment: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
Removing output files of failed job samtools_count_wc since they might be corrupted:
outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
Job failed, going on with independent jobs.
...
Error in rule bam_to_depth_wc:
    jobid: 26167
    output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.depth.txt
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/9048b0b8e113b3e7b4e477f4051b67a7
Job failed, going on with independent jobs.
    shell:

        samtools depth -aa outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/mappi
ng/SRR9178284.x.GCF_902527765.1.depth.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
...
Error in rule bam_to_fastq_wc:
    jobid: 28547
    output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.mapped.fq.gz
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/9048b0b8e113b3e7b4e477f4051b67a7
    shell:

        samtools bam2fq outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam | gzip > outputs.marine21_samples/m
apping/SRR9178284.x.GCF_902527765.1.mapped.fq.gz

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule download_matching_genome_wc

...
downloading genome for ident GCF_000014265.1/Trichodesmium erythraeum IMS101 from NCBI...
ESC[32m[Thu Nov 24 07:14:13 2022]ESC[0m
ESC[31mError in rule download_matching_genome_wc:ESC[0m
ESC[31m    jobid: 0ESC[0m
ESC[31m    output: genbank_cache/GCF_000173095.1_genomic.fna.gzESC[0m
ESC[31mESC[0m
ESC[32m[Thu Nov 24 07:14:13 2022]ESC[0m
ESC[31mError in rule download_matching_genome_wc:ESC[0m
ESC[31m    jobid: 0ESC[0m
ESC[31m    output: genbank_cache/GCF_000014265.1_genomic.fna.gzESC[0m
ESC[31mESC[0m
ESC[31mRuleException:
HTTPError in line 1063 of /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile:
HTTP Error 404: Not Found
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1063, in __rule_download_matching_genome_wc
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 214, in urlopen
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 523, in open
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 632, in http_response
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 561, in error
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 494, in _call_chain
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 641, in http_error_default
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/concurrent/futures/thread.py", line 58, in runESC[0m
ESC[31mRuleException:
HTTPError in line 1063 of /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile:
HTTP Error 404: Not Found
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1063, in __rule_dow
nload_matching_genome_wc
...

SRR11922358, error in rulemake_mapping_notebook_wc:

...
Error in rule make_mapping_notebook_wc:
    jobid: 90
    output: outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb, outputs.marine21_samples/reports/report-mappin
g-SRR11922358.html
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
    shell:

        papermill /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-mappin
g.ipynb outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb -k genome_grist               -p sample_id SRR11922
358 -p render ''               -p outdir outputs.marine21_samples --cwd outputs.marine21_samples/reports/
        python -m nbconvert outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb --to html --stdout --no-input
            --ExecutePreprocessor.kernel_name=genome_grist > outputs.marine21_samples/reports/report-mapping-SRR11922358.html

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

SRR13449930, error in rulemake_mapping_notebook_wc:

...
Error in rule make_mapping_notebook_wc:
    jobid: 167
    output: outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb, outputs.marine21_samples/reports/report-mappin
g-SRR13449930.html
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
    shell:

        papermill /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-mappin
g.ipynb outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb -k genome_grist               -p sample_id SRR13449
930 -p render ''               -p outdir outputs.marine21_samples --cwd outputs.marine21_samples/reports/
        python -m nbconvert outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb --to html --stdout --no-input
            --ExecutePreprocessor.kernel_name=genome_grist > outputs.marine21_samples/reports/report-mapping-SRR13449930.html

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Nov 27 '22 16:11 jeanzzhao

I am getting the same error, grist cannot download a specific genome. In my case is GCF_006715245.1

When I checked the status of the genome in the (NCBI ftp)[https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/715/245/GCF_006715245.1_ASM671524v1/], the genome is missing!.

There I found a message saying that the assembly status is suppressed. Therefore it make sense that it fails. My suggestion is to add a line to handle that error. I will try to put some code.

This is the relevant section in the Snakefile

# download actual genomes from genbank!
rule download_matching_genome_wc:
    input:
        csvfile = ancient(f'{GENBANK_CACHE}/{{ident}}.info.csv')
    output:
        genome = f"{GENBANK_CACHE}/{{ident}}_genomic.fna.gz"
    run:
        rows = list(load_csv(input.csvfile))
        assert len(rows) == 1
        row = rows[0]
        ident = row['ident']
        assert wildcards.ident.startswith(ident)
        url = row['genome_url']
        name = row['display_name']

        print(f"downloading genome for ident {ident}/{name} from NCBI...",
              file=sys.stderr)
        with open(output.genome, 'wb') as outfp:
            with urllib.request.urlopen(url) as response:
                content = response.read()
                outfp.write(content)
                print(f"...wrote {len(content)} bytes to {output.genome}",
                      file=sys.stderr)

Dec 02 '22 23:12 carden24

My solution: Change the snakefile from: Starting at line1062

        with open(output.genome, 'wb') as outfp:
            with urllib.request.urlopen(url) as response:
                content = response.read()
                outfp.write(content)
                print(f"...wrote {len(content)} bytes to {output.genome}",
                      file=sys.stderr)

to:

        with open(output.genome, 'wb') as outfp:
            try:
                with urllib.request.urlopen(url) as response:
                    content = response.read()
                    outfp.write(content)
                    print(f"...wrote {len(content)} bytes to {output.genome}",
                          file=sys.stderr)
            except:
                print(f"Genome not found for {ident}/{name}, skipping it",
                          file=sys.stderr)
                pass

Dec 03 '22 00:12 carden24

another genome missing: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/205/GCF_000020205.1_ASM2020v1/

Dec 04 '22 15:12 jeanzzhao

... [Thu Nov 24 07:14:11 2022] Error in rule samtools_count_wc: jobid: 27756 output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/50a874ea6fd99a2f81d96884a9de6c9e shell:
    samtools view -c -F 260 outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/
mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

in this one, it's hard to know what the error is because it occurred above the copy/paste - can you try rerunning with the target outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt in place of any of the other targets (gather_reads or summarize_mapping or whatnot)?

Error in rule make_mapping_notebook_wc: jobid: 90 output: outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb, outputs.marine21_samples/reports/report-mappin g-SRR11922358.html

I think I fixed this one in https://github.com/dib-lab/genome-grist/pull/242 which is now released in v0.9.1! So if you pip install -U genome-grist it should run!

I am getting the same error, grist cannot download a specific genome. In my case is GCF_006715245.1

Thanks @carden24! I have some ideas here - I don't want to just ignore the missing genomes... more in a bit.

Dec 04 '22 18:12 ctb

Here's one way I'm thinking of support "missing" genomes -

https://github.com/dib-lab/genome-grist/pull/255

I like the idea of requiring that they be added manually (or at least that manual acknowledgement be made).

A different or additional approach would be to suggest downloading them or a replacement manually and making it part of a private database.

Dec 04 '22 21:12 ctb

#255 is maturing. I'd be interested in your thoughts @carden24 @jeanzzhao

Dec 05 '22 15:12 ctb

I spent some time trying to get the data from other sources and I could not get it from the genbank or Gold but it is available from the JGI portal. I assume that this will not be the case for the other genomes so I am thinking than an alternative is to get another closely related genome based maybe on ANI or some other measurement of genome similarity.

Dec 05 '22 17:12 carden24

usually the genome has been removed for a good reason. I would probably go use GTDB or NCBI taxonomy to find another genome from the same species.

Dec 05 '22 17:12 ctb

Yes, totally agree, the criteria for removal from the NCBI can vary and there is no way to know programatically.

Dec 05 '22 19:12 carden24

I've just released genome-grist v0.9.2. pip install -U genome-grist should upgrade.

This includes skip_genomes - from the configuration page,

# skip_genomes: identifiers to ignore when they show up in gather output.
# This is useful when the sourmash database contains genomes that are no
# longer present in GenBank because they have been deprecated or suppressed.
#
# Note, in such cases you should try to find a new genome to include in
# a local database!
#
# DEFAULT: []
skip_genomes: []

You can use something like:

skip_genomes:
- GCF_000020205.1

to give it a try.

Dec 06 '22 14:12 ctb

I upgraded grist to 0.9.2 and run it again but snakemake if failing because it expects to have the genome downloaded as required in the rule output. I used the skip_genomes option in the config file and it was read successfully but cannot handle the missing output.

`[Tue Dec 6 08:40:45 2022] rule download_matching_genome_wc: input: genbank_cache/GCF_006715245.1.info.csv output: genbank_cache/GCF_006715245.1_genomic.fna.gz jobid: 144 reason: Missing output files: genbank_cache/GCF_006715245.1_genomic.fna.gz wildcards: ident=GCF_006715245.1 resources: tmpdir=/tmp

samples: ['Mock_T0_3_S3', 'Mock_T0_2_S2', 'Mock_T0_1_S1'] outdir: grist base_tempdir: /tmp/tmpf8o7ziq2 ['GCF_006715245.1'] Building DAG of jobs... Using shell: /bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Select jobs to execute... downloading genome for ident GCF_006715245.1/Bacillus sp. SLBN-3 from NCBI... Cannot download genome from URL: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/715/245/GCF_006715245.1_ASM671524v1/GCF_006715245.1_ASM671524v1_genomic.fna.gz Is it missing? If so, consider adding 'GCF_006715245.1' to 'skip_genomes' list in config file. [Tue Dec 6 08:40:45 2022] Error in rule download_matching_genome_wc: jobid: 0 input: genbank_cache/GCF_006715245.1.info.csv output: genbank_cache/GCF_006715245.1_genomic.fna.gz

RuleException: Exception in line 1077 of /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile: Genbank genome not found File "/home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1077, in __rule_download_matching_genome_wc File "/home/mixtures/miniconda3/envs/grist/lib/python3.9/concurrent/futures/thread.py", line 58, in run Removing output files of failed job download_matching_genome_wc since they might be corrupted: genbank_cache/GCF_006715245.1_genomic.fna.gz Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2022-12-06T084040.101718.snakemake.log`

Dec 06 '22 16:12 carden24

hi @carden24 just to confirm, did you add it to skip_genomes in the config file?

skip_genomes:
- GCF_006715245.1

Dec 06 '22 16:12 ctb

should I remove file like GCF_000173095.1.info.csv under genbank_cache/ before starting the re-run?

Dec 06 '22 18:12 jeanzzhao

Yes on the config file. but I think I needed to clean files before rerunning. Originally I run genome-grist , and once it failed because of the missing error, I added the skip_genomes option and tried to rerun. I have now removed al the output folders and now it works. Seemed like IGNORE_IDENT variable is used in earlier steps and that is why it kept looking for those genome. Thanks a lot for the help.

Dec 06 '22 18:12 carden24

@carden24 could you remind me which files you cleaned? thanks

Dec 06 '22 18:12 jeanzzhao

I removed the genbank_cache folder, the gather one, and the sig one too. Not sure if all of them were required.

Dec 06 '22 18:12 carden24

hmm, that's interesting 😓 it should be downstream of those, although removing them will certainly force recalculation of everything downstream! @jeanzzhao wait a few and I'll see if I can figure out something more precise!

Dec 06 '22 18:12 ctb

Whoops, looks like I messed up the skip_genomes code in #255 - I needed to add it in one more place. Working on a fix in https://github.com/dib-lab/genome-grist/pull/259. Apologies!

Dec 07 '22 14:12 ctb

Merged #259 and released genome-grist v0.9.3. Please give it a try:

pip install -U genome-grist

Dec 07 '22 14:12 ctb

(you shouldn't need to remove or edit any files to get this to work, @jeanzzhao)

Dec 07 '22 14:12 ctb

pip install -U genome-grist, v0.9.3., did not remove any previous file, sbatch #58672657, failed

'/home/zyzhao/assloss/grist/marine44/.snakemake/log/2022-12-08T082206.462631.snakemake.log'

Building DAG of jobs...
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.

Using shell: /bin/bash
Provided cores: 11
Rules claiming more threads will be scaled down.
Job stats:
job                                 count    min threads    max threads
--------------------------------  -------  -------------  -------------
copy_sample_genomes_to_output_wc       19              1              1
download_matching_genome_wc             8              1              1
make_combined_info_csv_wc              19              1              1
make_gather_notebook_wc                19              1              1
summarize_gather                        1              1              1
total                                  66              1              1

Select jobs to execute...

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000472565.1.info.csv
    output: genbank_cache/GCF_000472565.1_genomic.fna.gz
    jobid: 935
    wildcards: ident=GCF_000472565.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000504225.1.info.csv
    output: genbank_cache/GCF_000504225.1_genomic.fna.gz
    jobid: 995
    wildcards: ident=GCF_000504225.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000020205.1.info.csv
    output: genbank_cache/GCF_000020205.1_genomic.fna.gz
    jobid: 1003
    wildcards: ident=GCF_000020205.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000472605.1.info.csv
    output: genbank_cache/GCF_000472605.1_genomic.fna.gz
    jobid: 983
    wildcards: ident=GCF_000472605.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000701385.1.info.csv
    output: genbank_cache/GCF_000701385.1_genomic.fna.gz
    jobid: 1737
    wildcards: ident=GCF_000701385.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000173095.1.info.csv
    output: genbank_cache/GCF_000173095.1_genomic.fna.gz
    jobid: 1469
    wildcards: ident=GCF_000173095.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:14 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000597705.1.info.csv
    output: genbank_cache/GCF_000597705.1_genomic.fna.gz
    jobid: 2329
    wildcards: ident=GCF_000597705.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:14 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000014265.1.info.csv
    output: genbank_cache/GCF_000014265.1_genomic.fna.gz
    jobid: 2919
    wildcards: ident=GCF_000014265.1
    resources: tmpdir=/tmp

Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
Complete log: /home/zyzhao/assloss/grist/marine44/.snakemake/log/2022-12-08T082206.462631.snakemake.log

Dec 08 '22 16:12 jeanzzhao

I am getting an error at the make_gather_notebook_wc step. I run it with a simple sample.

`Error in rule make_gather_notebook_wc: jobid: 1 input: /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-gather.ipynb, grist/gather/Mock_T0_3_S3.gather.csv.gz, grist/gather/Mock_T0_3_S3.genomes.info.csv, grist/.kernel.set output: grist/reports/report-gather-Mock_T0_3_S3.ipynb, grist/reports/report-gather-Mock_T0_3_S3.html conda-env: /home/mixtures/erick_dev/GC_Test/Test4/.snakemake/conda/3661d3423026d9d473032c65ccc8aec6_ shell:

    papermill /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-gather.ipynb grist/reports/report-gather-Mock_T0_3_S3.ipynb -k genome_grist               -p sample_id Mock_T0_3_S3 -p render '' -p outdir grist              --cwd grist/reports/
    python -m nbconvert grist/reports/report-gather-Mock_T0_3_S3.ipynb --to html --stdout --no-input              --ExecutePreprocessor.kernel_name=genome_grist > grist/reports/report-gather-Mock_T0_3_S3.html

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job make_gather_notebook_wc since they might be corrupted: grist/reports/report-gather-Mock_T0_3_S3.ipynb `

This is the folder structure of the grist folder:

grist ├── gather │ ├── Mock_T0_3_S3.gather.csv.gz │ ├── Mock_T0_3_S3.gather.out │ ├── Mock_T0_3_S3.genomes.info.csv │ ├── Mock_T0_3_S3.known.sig.zip │ ├── Mock_T0_3_S3.matches.sig.zip │ ├── Mock_T0_3_S3.prefetch.csv.gz │ └── Mock_T0_3_S3.unknown.sig.zip ├── genomes │ ├── GCF_000009045.1_genomic.fna.gz │ ├── GCF_000009045.1.info.csv │ ├── GCF_000012905.2_genomic.fna.gz │ ├── GCF_000012905.2.info.csv │ ├── GCF_000219605.1_genomic.fna.gz │ ├── GCF_000219605.1.info.csv │ ├── GCF_000238915.1_genomic.fna.gz │ ├── GCF_000238915.1.info.csv │ ├── GCF_000368145.1_genomic.fna.gz │ ├── GCF_000368145.1.info.csv │ ├── GCF_000368685.1_genomic.fna.gz │ ├── GCF_000368685.1.info.csv │ ├── GCF_001042485.2_genomic.fna.gz │ ├── GCF_001042485.2.info.csv │ ├── GCF_001646745.1_genomic.fna.gz │ ├── GCF_001646745.1.info.csv │ ├── GCF_900215245.1_genomic.fna.gz │ └── GCF_900215245.1.info.csv ├── raw │ ├── Mock_T0_3_S3_1.fastq.gz │ └── Mock_T0_3_S3_2.fastq.gz ├── sigs │ └── Mock_T0_3_S3.trim.sig.zip └── trim ├── Mock_T0_3_S3.trim.fq.gz ├── Mock_T0_3_S3.trim.html └── Mock_T0_3_S3.trim.json

Dec 08 '22 22:12 carden24

pip install -U genome-grist, v0.9.3., did not remove any previous file, sbatch #58672657, failed

Hi Jean, I took a look at ~assloss/grist/marine44/ and tried running one of your samples as below - so far it's working. I wonder if you "just" need to add more skip_genomes? It's annoying to figure out, I know... I'll seek additional solutions!

samples:
- SRR5915428
outdir: outputs.jean/

sourmash_databases:
- gtdb-rs207.genomic.k31.zip

skip_genomes:
- GCF_000472605.1
- GCF_000504225.1

Dec 14 '22 03:12 ctb

Hi Titus,

I realized that I did not have rs207 in the folder when I changed conf.yml to rs207.
curl -L https://osf.io/w4bcm/download -o gtdb-rs207.genomic-reps.k31.sbt.zip
re-run, sbatch #58926209, failed after ~9h with different Error in "rule make_combined_info_csv_wc"

refer to this for details: https://hackmd.io/DOWP1qUzTCqdihYOSyp5Zg?view#12922 -Jean

On Tue, Dec 13, 2022 at 7:46 PM C. Titus Brown @.***> wrote:

pip install -U genome-grist, v0.9.3., did not remove any previous file, sbatch #58672657, failed

Hi Jean, I took a look at ~assloss/grist/marine44/ and tried running one of your samples as below - so far it's working. I wonder if you "just" need to add more skip_genomes? It's annoying to figure out, I know... I'll seek additional solutions!

samples:

SRR5915428 outdir: outputs.jean/

sourmash_databases:

gtdb-rs207.genomic.k31.zip

skip_genomes:

GCF_000472605.1

GCF_000504225.1

— Reply to this email directly, view it on GitHub https://github.com/dib-lab/genome-grist/issues/241#issuecomment-1350352382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWMYGGCWL45JVAF4AWMYFJDWNE7I5ANCNFSM6AAAAAASMSV76I . You are receiving this because you were mentioned.Message ID: @.***>

Dec 14 '22 15:12 jeanzzhao

I am still having issues with this error. I think that there are still some rules that need to incorporate a check to ignore genomes that cannot be downloaded.

These rules correctly ignore the missing genome specified in the yaml:

download_matching_genome
make_genbank_info_csv
bam_to_depth_wc
minimap_wc
samtools_mpileup_wc
samtools_count_wc
bam_to_fastq_wc

The first rule that is creating an error is extract_leftover_reads_wc. I checked its code and it seems that it uses as input the gather_csv file but it does not check for the flagged genomes in the python script substract_gather.py

   input:
        csv = f'{outdir}/gather/{{sample}}.gather.csv.gz',
        mapped = Checkpoint_GatherResults(f"{outdir}/mapping/{{sample}}.x.{{ident}}.mapped.fq.gz"),

These other rules also used that csv as input make_gather_notebook_wc - > Uses papermill and report-gather.ipynb make_mapping_notebook_wc -> Uses papermill and report-mapping.ipynb .

A possible solution would be to pass as an argument the list of flagged genomes (IGNORE_IDENTS) to the python script when it is loading the list of genomes from the csv

Line 29:

    with gzip.open(args.gather_csv, "rt") as fp:
        r = csv.DictReader(fp)
        for row in r:
            rows.append(row)
    print(f"...loaded {len(rows)} results total.")

    print("checking input/output pairs:")
    pairs = []
    fail = False
    for row in rows:
        acc = row["name"].split()[0]
>>>if acc in IGNORE_IDENTS:
>>>   continue
>>>   print("Ignoring {acc} ")
>>>else:
            filename = f"{outdir}/mapping/{sample_id}.x.{acc}.mapped.fq.gz"
            overlapping = f"{outdir}/mapping/{sample_id}.x.{acc}.overlap.fq.gz"
            leftover = f"{outdir}/mapping/{sample_id}.x.{acc}.leftover.fq.gz"
            if not os.path.exists(filename):
                print(f"ERROR: input filename {filename} does not exist. Will exit.")
                 fail = True
            pairs.append((acc, filename, overlapping, leftover))

I don't know enough about python notebooks to suggest a solution there.

Apr 05 '23 04:04 carden24

genome-grist genome-grist copied to clipboard

Errors if genome-grist run on Marine metagenomes

genome-grist
genome-grist copied to clipboard