charcoal icon indicating copy to clipboard operation
charcoal copied to clipboard

No cleaning with exact matches even when lineage is provided

Open taylorreiter opened this issue 3 years ago • 5 comments

When running charcoal to clean contigs, I noticed:

python -m charcoal run inputs/charcoal-conf.yml all_clean_contigs
filter rank is none; not doing any cleaning.
wrote 2171793 clean bp to /home/tereiter/github/2020-ibd/outputs/charcoal/GCA_900758465.1_genomic.fna.gz.clean.fa.gz

This is stemming from the stage1_hitlist.csv file, which states there is no lineage provided.

GCF_008121495.1_genomic.fna.gz,none,,0,0,0,0,0,0,0,1.000,1.000,,"found exact match: GCF_008121495.1 [Ruminococcus] gnavus ATCC 29149 strain=JCM6515, ASM812149v1. but no provided lineage! cannot analyze."

However, I do provide lineages in my config file:

# location for all generated files
output_dir: /home/tereiter/github/2020-ibd/outputs/charcoal/
# list of genome filenames to decontaminate
genome_list: /home/tereiter/github/2020-ibd/outputs/charcoal_conf/charcoal.genome-list.txt
# directory in which genome filenames live
genome_dir: /home/tereiter/github/2020-ibd/genbank_genomes
# (optional) list of lineages for input genomes. comment out or leave
# blank if none.
provided_lineages: /home/tereiter/github/2020-ibd/outputs/genbank/gather_vita_vars_gtdb_shared_assemblies.x.genbank.lineages.csv
# match_rank is the rank _above_ which cross-lineage matches are considered
# contamination. e.g. if set to 'superkingdom', then Archaeal matches in
# Bacterial genomes will be contamination, but nothing else.
#
# values can be superkingdom, phylum, class, order, family, or genus.
match_rank: order
# sourmash query databases for contamination (SBTs, LCAs, or signatures)
gather_db:
 - /group/ctbrowngrp/gtdb/databases/gtdb-rs202.genomic.k31.zip
# lineages CSV (see `sourmash lca index`) for signatures in query databases
lineages_csv: /group/ctbrowngrp/gtdb/gtdb-rs202.taxonomy.csv
strict: 1

When I removed the stage1/*.hitlist_for_filtering.csv, rule compare_taxonomy_single is re-run, and I think this is where the problem stems from. It loads the 54 provided lineages, but doesn't do anything with them.

$ python -m charcoal run inputs/charcoal-conf.yml all_clean_contigs -j 1
** read 54 provided lineages
** config file checks PASSED!
** from here on out, it's all snakemake...
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all_clean_contigs
        54      clean_contigs
        1       combine_hit_list
        52      compare_taxonomy_single
        108
Select jobs to execute...

[Wed Jul  7 09:44:56 2021]
rule compare_taxonomy_single:
    input: /home/tereiter/github/2020-ibd/outputs/charcoal/stage1/GCF_003479605.1_genomic.fna.gz.contigs-tax.json, /home/tereiter/github/2020-ibd/outputs/charcoal/stage1/GCF_003479605.1_genomic.fna.gz.matches.zip, /group/c$browngrp/gtdb/gtdb-rs202.taxonomy.csv, /home/tereiter/github/2020-ibd/outputs/genbank/gather_vita_vars_gtdb_shared_assemblies.x.genbank.lineages.csv, /home/tereiter/github/2020-ibd/outputs/charcoal_conf/charcoal.genome-lis$.txt
    output: /home/tereiter/github/2020-ibd/outputs/charcoal/stage1/GCF_003479605.1_genomic.fna.gz.hitlist_for_filtering.csv, /home/tereiter/github/2020-ibd/outputs/charcoal/stage1/GCF_003479605.1_genomic.fna.gz.genome_summ$ry.csv, /home/tereiter/github/2020-ibd/outputs/charcoal/stage1/GCF_003479605.1_genomic.fna.gz.contam_summary.json
    jobid: 155
    wildcards: g=GCF_003479605.1_genomic.fna.gz

Activating conda environment: /home/tereiter/github/2020-ibd/.snakemake/conda/11b6d58027966e25866faeb8fd08fa61
examining spreadsheet headers...
** assuming column 'ident' is identifiers in spreadsheet
loaded 258406 tax assignments.
working on GCF_003479605.1_genomic.fna.gz
loaded 54 provided lineages
found exact match: GCF_003479605.1 Roseburia sp. AF12-17LB strain=AF12-17LB, ASM347960v1. but no provided lineage!
examining GCF_003479605.1_genomic.fna.gz for contamination
   superkingdom: 0 contigs w/ 0kb
   phylum: 0 contigs w/ 0kb
   class: 0 contigs w/ 0kb
   order: 0 contigs w/ 0kb
   family: 0 contigs w/ 0kb
   genus: 0 contigs w/ 0kb
   (total): 0 contigs w/ 0kb
processed GCF_003479605.1_genomic.fna.gz.
saving contamination summary to /home/tereiter/github/2020-ibd/outputs/charcoal/stage1/GCF_003479605.1_genomic.fna.gz.contam_summary.json

Charcoal version/installation environment:

This is charcoal version v0.1.dev324+g66e6635

Package install path: /home/tereiter/github/2020-ibd/.snakemake/conda/59cc953d/lib/python3.9/site-packages/charcoal
Install-wide config file: /home/tereiter/github/2020-ibd/.snakemake/conda/59cc953d/lib/python3.9/site-packages/charcoal/conf/system.conf
snakemake Snakefile: /home/tereiter/github/2020-ibd/.snakemake/conda/59cc953d/lib/python3.9/site-packages/charcoal/Snakefile

taylorreiter avatar Jul 07 '21 16:07 taylorreiter