anvio icon indicating copy to clipboard operation
anvio copied to clipboard

[BUG] anvi-run-hmms silently fails when multithreaded

Open dspeth opened this issue 4 years ago • 43 comments

Short description of the problem

anvi-run-hmms silently fails when multithreaded, and run on a single (or low number of) contigs

anvi'o version

Anvi'o .......................................: hope (v7)

Profile database .............................: 35
Contigs database .............................: 20
Pan database .................................: 14
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 2
tRNA-seq database ............................: 1

System info

Linux version 4.9.0-13-amd64 ([email protected]) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.228-1 (2020-07-05)

Detailed description of the issue

Hello anvio team, i hope this bug is something you have already encountered and fixed in the master branch, since it concerns HMMs, and I did notice you have spent quite a bit of energy on handling them recently. However, I could not find an issue specifically referencing this problem, so I'll open a new one.

I was building a phylogenomic tree using the steps described in the anvio phylogenomic workflow. My target was a recently published gammaproteobacterial endosymbiont ( https://www.ncbi.nlm.nih.gov/nuccore/LR794158), and so I was curious what the alignment for this specific organism looked like. However, a quick grep of the concatenated protein fastafile did not yield anything, which was somewhat surprising. Some other contigs databases, which were in multiple contigs, seemed to be fine.

(Note: a potential side issue is that despite it's many warnings anvi-get-sequences-for-hmm-hits does not warn you when one of your contigs dbs does not have any hits to the HMM collection of choice, it juts omits them from the final alignment)

some sleuthing ensued, and I realized I had used anvi-run-hmms with the -T 50 flag. (Did I really use 50 cores to run HMMs on a 300kb genome you ask? Yes I did, and so I maybe deserved the ensuing mess for being wasteful). Taking the contigs db and running anvi-run-hmms yielded a decreasing number of hits to the Bacteria_71 HMM collection, with increasing number of cores, hitting 0 somewhere between 40-50, as detailed by the pasted output of the number of hits to the collection below. Since this seemed to behave fine for genomes in more pieces (no systematic testing done), I think it might have something to do with the partitioning of a single sequence during the multithreading.

Hope this helps! Daan

anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db

HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified, https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmpd645k8uj/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmp2lasbim2
Log file for thread 0 ........................: /tmp/tmp2lasbim2/AA_gene_sequences.fa.0_log
Number of raw hits ...........................: 63                                                                                                                                                             
Number of weak hits removed ..................: 0                                                                                                                                                              
Number of hits in annotation dict  ...........: 63
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 2

HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
                                                https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmpph05hta4/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmp_louykze
Log file for thread 0 ........................: /tmp/tmp_louykze/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmp_louykze/AA_gene_sequences.fa.1_log
Number of raw hits ...........................: 21                                                 
Number of weak hits removed ..................: 0                                                  
Number of hits in annotation dict  ...........: 21
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 3

HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
                                                https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmp7_5b1ouh/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmp6lvjpvem
Log file for thread 0 ........................: /tmp/tmp6lvjpvem/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmp6lvjpvem/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmp6lvjpvem/AA_gene_sequences.fa.2_log
Number of raw hits ...........................: 17                                                 
Number of weak hits removed ..................: 0                                                  
Number of hits in annotation dict  ...........: 17
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 4

HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
                                                https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmp_a8836k8/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmpq0xwm06_
Log file for thread 0 ........................: /tmp/tmpq0xwm06_/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmpq0xwm06_/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmpq0xwm06_/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmpq0xwm06_/AA_gene_sequences.fa.3_log
Number of raw hits ...........................: 10                                                 
Number of weak hits removed ..................: 0                                                  
Number of hits in annotation dict  ...........: 10
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 5

HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
                                                https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmpknk0_xq0/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmplb4oin0s
Log file for thread 0 ........................: /tmp/tmplb4oin0s/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmplb4oin0s/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmplb4oin0s/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmplb4oin0s/AA_gene_sequences.fa.3_log
Log file for thread 4 ........................: /tmp/tmplb4oin0s/AA_gene_sequences.fa.4_log
Number of raw hits ...........................: 8                                                  
Number of weak hits removed ..................: 0                                                  
Number of hits in annotation dict  ...........: 8
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 10

HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
                                                https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmp0ue00x54/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmpw0a4b9ro
Log file for thread 0 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.3_log
Log file for thread 4 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.4_log
Log file for thread 5 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.5_log
Log file for thread 6 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.6_log
Log file for thread 7 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.7_log
Log file for thread 8 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.8_log
Log file for thread 9 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.9_log
Number of raw hits ...........................: 8                                                  
Number of weak hits removed ..................: 0                                                  
Number of hits in annotation dict  ...........: 8
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 20

HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
                                                https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmptsqxt8yh/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmpbihm5bh0
Log file for thread 0 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.3_log
Log file for thread 4 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.4_log
Log file for thread 5 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.5_log
Log file for thread 6 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.6_log
Log file for thread 7 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.7_log
Log file for thread 8 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.8_log
Log file for thread 9 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.9_log
Log file for thread 10 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.10_log
Log file for thread 11 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.11_log
Log file for thread 12 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.12_log
Log file for thread 13 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.13_log
Log file for thread 14 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.14_log
Log file for thread 15 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.15_log
Log file for thread 16 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.16_log
Log file for thread 17 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.17_log
Log file for thread 18 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.18_log
Log file for thread 19 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.19_log
Number of raw hits ...........................: 2                                                  
Number of weak hits removed ..................: 0                                                  
Number of hits in annotation dict  ...........: 2
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 40

HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
                                                https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmpf0j2rpes/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmp9s9t5sql
Log file for thread 0 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.3_log
Log file for thread 4 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.4_log
Log file for thread 5 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.5_log
Log file for thread 6 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.6_log
Log file for thread 7 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.7_log
Log file for thread 8 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.8_log
Log file for thread 9 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.9_log
Log file for thread 10 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.10_log
Log file for thread 11 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.11_log
Log file for thread 12 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.12_log
Log file for thread 13 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.13_log
Log file for thread 14 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.14_log
Log file for thread 15 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.15_log
Log file for thread 16 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.16_log
Log file for thread 17 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.17_log
Log file for thread 18 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.18_log
Log file for thread 19 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.19_log
Log file for thread 20 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.20_log
Log file for thread 21 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.21_log
Log file for thread 22 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.22_log
Log file for thread 23 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.23_log
Log file for thread 24 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.24_log
Log file for thread 25 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.25_log
Log file for thread 26 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.26_log
Log file for thread 27 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.27_log
Log file for thread 28 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.28_log
Log file for thread 29 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.29_log
Log file for thread 30 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.30_log
Log file for thread 31 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.31_log
Log file for thread 32 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.32_log
Log file for thread 33 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.33_log
Log file for thread 34 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.34_log
Log file for thread 35 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.35_log
Log file for thread 36 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.36_log
Log file for thread 37 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.37_log
Log file for thread 38 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.38_log
Log file for thread 39 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.39_log
Number of raw hits ...........................: 1                                                  
Number of weak hits removed ..................: 0                                                  
Number of hits in annotation dict  ...........: 1

anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 50

HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
                                                https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmp_muvxx5r/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmp33y1315u
Log file for thread 0 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.3_log
Log file for thread 4 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.4_log
Log file for thread 5 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.5_log
Log file for thread 6 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.6_log
Log file for thread 7 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.7_log
Log file for thread 8 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.8_log
Log file for thread 9 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.9_log
Log file for thread 10 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.10_log
Log file for thread 11 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.11_log
Log file for thread 12 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.12_log
Log file for thread 13 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.13_log
Log file for thread 14 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.14_log
Log file for thread 15 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.15_log
Log file for thread 16 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.16_log
Log file for thread 17 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.17_log
Log file for thread 18 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.18_log
Log file for thread 19 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.19_log
Log file for thread 20 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.20_log
Log file for thread 21 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.21_log
Log file for thread 22 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.22_log
Log file for thread 23 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.23_log
Log file for thread 24 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.24_log
Log file for thread 25 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.25_log
Log file for thread 26 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.26_log
Log file for thread 27 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.27_log
Log file for thread 28 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.28_log
Log file for thread 29 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.29_log
Log file for thread 30 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.30_log
Log file for thread 31 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.31_log
Log file for thread 32 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.32_log
Log file for thread 33 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.33_log
Log file for thread 34 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.34_log
Log file for thread 35 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.35_log
Log file for thread 36 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.36_log
Log file for thread 37 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.37_log
Log file for thread 38 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.38_log
Log file for thread 39 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.39_log
Log file for thread 40 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.40_log
Log file for thread 41 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.41_log
Log file for thread 42 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.42_log
Log file for thread 43 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.43_log
Log file for thread 44 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.44_log
Log file for thread 45 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.45_log
Log file for thread 46 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.46_log
Log file for thread 47 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.47_log
Log file for thread 48 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.48_log
Log file for thread 49 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.49_log
Number of raw hits ...........................: 0                                                  
                                                                                                   
* The HMM source 'Bacteria_71' returned 0 hits. SAD (but it's stil OK).

Files to reproduce

contig database generated from https://www.ncbi.nlm.nih.gov/nuccore/LR794158

dspeth avatar Jun 04 '21 13:06 dspeth

Thank you very much for the detailed report and files/commands to reproduce the issue, Daan!

One thing I'm missing here is the version of the HMMER you have installed in your anvi'o environment :)

meren avatar Jun 04 '21 13:06 meren

HMMER 3.2.1 anvio v7 was installed following the conda/pip instructions on the installation page, roughly 2 weeks ago

dspeth avatar Jun 04 '21 13:06 dspeth

@ivagljiva, I feel like the main branch does not suffer from this anymore, but can you please take a look at this if you find a moment today? Otherwise I will look into it later, so no worries. Thank you!

meren avatar Jun 04 '21 13:06 meren

I will take a look @meren :)

ivagljiva avatar Jun 04 '21 14:06 ivagljiva

@dspeth and @meren, I have tested this in the development branch and it happily appears to be something we already fixed :)

I was able to increase from 1 to 37 threads and still get 63 hits from Bacteria_71 every time:

$ anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -I Bacteria_71 -T 37

WARNING
===============================================
Previous entries for "Bacteria_71" is being removed from "hmm_hits_info,
hmm_hits, hmm_hits_in_splits, genes_in_contigs, gene_functions"

Contigs DB ...................................: A_ciliaticola_LR794158.db
HMM sources ..................................: Bacteria_71
Alphabet/context target found ................: AA:GENE

HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified, https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmp3wzcwjzt/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 37
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmpar4jwpi0
Log file for thread 0 ........................: /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmpar4jwpi0/AA_gene_sequences.fa.0_log
[...........]
Log file for thread 36 .......................: /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmpar4jwpi0/AA_gene_sequences.fa.36_log
Number of raw hits in table file .............: 63
Number of weak hits removed by HMMER parser ..: 0
Number of hits in annotation dict  ...........: 63

I didn't test with more threads because I started getting error tracebacks for overloading my poor laptop, but happy to throw this onto the cluster for more extensive testing if you are worried that the trend wouldn't hold for higher numbers.

a potential side issue is that despite it's many warnings anvi-get-sequences-for-hmm-hits does not warn you when one of your contigs dbs does not have any hits to the HMM collection of choice, it juts omits them from the final alignment

I think you are right that we need a warning for this, @dspeth. I will work on adding one :)

ivagljiva avatar Jun 04 '21 16:06 ivagljiva

I have added a warning (commit), and now if you run anvi-get-sequences-for-hmm-hits on a database that does not have hits, you should see something like the following:

WARNING
===============================================
SequencesForHMMHits class here. The current database (at
/Users/iva/Downloads/HMM_BUG/no_hits.db) contains 0 HMM hits, at least within
the HMM sources or splits that were requested. It might not be a problem for
your case, but we just thought you should know, in case it is. So there you have
it.

ivagljiva avatar Jun 04 '21 16:06 ivagljiva

thanks @ivagljiva & @meren!

dspeth avatar Jun 07 '21 16:06 dspeth

Hi @ivagljiva & @meren,

I'm sad to report that I encountered this issue again yesterday, both in the master branch and version 7.1.

I was alerted to it when using anvi-get-sequences-for-hmm-hits with the --max-num-genes-missing-from-bin flag (which is a great addition, by the way :) ) and losing genomes that I know for a fact had the minimum required of genes.

A quick test of re-running hmms with 20 threads in both 7.1 and master gave me 0 hits, and subsequently using 1 thread I got the correct number (60).

dspeth avatar Dec 05 '21 11:12 dspeth

This is craaaaayyyy.

So you have a contigs db that we can use to reproduce this? Can you anvi-split the genome of interest into its own contigs db and send it here so we have a minimal case to test this?

Apologies for this and thanks a lot for letting us know, Daan :(

meren avatar Dec 05 '21 15:12 meren

Hi @meren, I've emailed you the contigs dbs where this issue occurred for me. They are generated from this genome: https://www.ncbi.nlm.nih.gov/nuccore/LR794158 Oddly, this is the genome that @ivagljiva tested the behaviour on earlier, and found no issues

That said, I noticed it in many contigs dbs on my system. Some of those must have been migrated, but not sure that this would cause this problem.

dspeth avatar Dec 05 '21 15:12 dspeth

To test that last statement I just freshly generated the contigs db from the genome fasta (with simplified name), and get the defective behaviour in both the master branch and v7.1

dspeth avatar Dec 05 '21 15:12 dspeth

ah, rereading the issue thread: my HMMER version is 3.3.2 in both conda envs, so a different version than was installed when I encountered this issue first

dspeth avatar Dec 05 '21 15:12 dspeth

We'll check with all versions and report back, Daan. Thank you!

meren avatar Dec 05 '21 16:12 meren

Hey @dspeth, I am unable to reproduce this. First, I generated this test script here (and put it in a file called test_issue_1748.sh):

# check HMMER version
echo -n "HMMER version: "
hmmscan -h | grep HMMER

# get rid of the db if exists
rm -rf A_ciliaticola.db

# generate a fresh contigs db from the FASTA
anvi-gen-contigs-database -f A_ciliaticola_LR794158_sn.fa -o A_ciliaticola.db --quiet
echo -n "Number of genes Prodigal found in the genome: "
sqlite3 A_ciliaticola.db 'select count(*) from genes_in_contigs where source="prodigal";'

# create a copy of the vanilla contigs db as '.orig'
cp A_ciliaticola.db A_ciliaticola.db.orig

# run anvi-run-hmms with a single thread
anvi-run-hmms -c A_ciliaticola.db --quiet
echo -n "Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): "
sqlite3 A_ciliaticola.db 'select count(*) from hmm_hits where source="Bacteria_71";'

# replace the db with the original one
cp A_ciliaticola.db.orig A_ciliaticola.db

# run anvi-run-hmms with multiple threads
anvi-run-hmms -c A_ciliaticola.db -T 6 --quiet
echo -n "Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): "
sqlite3 A_ciliaticola.db 'select count(*) from hmm_hits where source="Bacteria_71";'

I run it first on my active branch, which is running HMMER v3.3 from Nov 2019 (both single-threaded and multi-threaded runs found 63 SCGs):

time bash test_issue_1748.sh
HMMER version: # HMMER 3.3 (Nov 2019); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 63

real	0m16.363s
user	0m16.181s
sys	0m4.709s

Then I run it using a fresh v7.1 setup, which is running HMMER v3.3.2 from Nov 2020 (and again, both single-threaded and multi-threaded runs found 63 SCGs):

time bash test_issue_1748.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 63

real	0m16.548s
user	0m18.558s
sys	0m5.308s

The test file is here, please try it on your system and send the output.

meren avatar Dec 07 '21 11:12 meren

Here it is @meren :

v7.1: HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/ Number of genes Prodigal found in the genome: 311 Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63. Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 8

master: HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/ Number of genes Prodigal found in the genome: 311 Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63 Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 8

dspeth avatar Dec 07 '21 20:12 dspeth

what I had noticed (at the time I first reported this), is that the number of genes reported does not seem to be fixed for a given number of threads, when the number of threads gets really high. I'll try to see if I can illustrate what I mean, and post.

my best guess is that some file/variable gets overwritten, giving only the output of the last thread to finish. With a larger number of threads, I think that isn't always the same one.

dspeth avatar Dec 07 '21 20:12 dspeth

@meren Here's an example showing this behaviour, with number of threads set to 30:

(anvi_dev) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh 
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 0
(anvi_dev) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh 
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 3
(anvi_dev) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh 
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 0

As you can see the first and third run yield 0 hits, while the second gives 3.

I don't know whether it's easy to check where/how the output of the multithreading gets merged in anvi-run-hmms but that would be where I think the problem is

dspeth avatar Dec 07 '21 20:12 dspeth

The output from each thread gets put onto a queue after the thread finishes, and then each portion of the output is appended to a single file. This all happens in the HMMER driver class in anvio/drivers/hmmer.py, so that is likely where the issue is coming from.

ivagljiva avatar Dec 07 '21 22:12 ivagljiva

I still have no idea how I can not reproduce it on a Mac system with the same version of HMMER. I'm afraid this is related to architecture depended differences how Python threads are handled. If it is the case, this is very bad news :(

meren avatar Dec 07 '21 22:12 meren

I'm afraid this is related to architecture depended differences how Python threads are handled

Oh no :( The multiprocessing class (which is the one we use) is supposed to run on both Windows and Unix, but there is no mention of Linux :(

ivagljiva avatar Dec 07 '21 22:12 ivagljiva

Crap.

 :: anvi'o dev ::  SSH://MIDWAY  ~ >>> bash test.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 8

meren avatar Dec 07 '21 22:12 meren

That is very not good :(

ivagljiva avatar Dec 07 '21 22:12 ivagljiva

I guess the next thing to do is to find out what causes this on Linux systems :/

Oh no :( The multiprocessing class (which is the one we use) is supposed to run on both Windows and Unix, but there is no mention of Linux :(

Usually something that works with 'Unix systems' include Linux. So the library is certainly for Linux.

I believe the difference will come from how processes are forked and completed in different operating systems will require tiny adjustments. I'm sure people already wrote a ton about this. It is pretty late here. But I will try to get back to this soon.

meren avatar Dec 07 '21 22:12 meren

Any help / input from anyone would be most welcome, of course. I don't want to make it sound like I want to solve this alone :)

meren avatar Dec 07 '21 22:12 meren

aiaiai, sorry to hear this.

I agree that the library must work on linux, otherwise there would be many very noisy errors.

Also, I might do some googling, but unfortunately think we've reached the end of my skillset :(

dspeth avatar Dec 07 '21 22:12 dspeth

We need to compare whether the raw files the program produces when it is run with multiple threads on Mac and Linux systems are identical one not. This is very annoying.

meren avatar Dec 07 '21 22:12 meren

Thank you for bringing this to our attention, @dspeth. I'm sure we will find a way to address this :)

meren avatar Dec 07 '21 22:12 meren

A new development that shows that this is related to how we run different models. it may be a race condition that influence small genomes.

Here is a fully reproducible test script:

#!/bin/bash

set -e

# download the contigs db
rm -rf A_ciliaticola.db.orig
curl -L -o A_ciliaticola.db.orig https://www.dropbox.com/s/z172lpj87fcofrr/A_ciliaticola.db.orig

num_genes=`sqlite3 A_ciliaticola.db 'select count(*) from genes_in_contigs where source="prodigal";'`
echo
echo "A contigs DB w/ $num_genes genes has been downloaded. Running HMMs."
echo

# vanilla contigs db
cp A_ciliaticola.db.orig A_ciliaticola.db

# run all HMMs in multiple threads
anvi-run-hmms -c A_ciliaticola.db -T 6 --quiet
echo -n "Num Bacteria_71 HMM hits found in contigs-db (MULTI-THREADED, ALL MODELS) .......: "
sqlite3 A_ciliaticola.db 'select count(*) from hmm_hits where source="Bacteria_71";'

# vanilla contigs db
cp A_ciliaticola.db.orig A_ciliaticola.db

# run all HMMs in multiple threads
anvi-run-hmms -c A_ciliaticola.db -T 6 -I Bacteria_71 --quiet
echo -n "Num Bacteria_71 HMM hits found in contigs-db (MULTI-THREADED, ONLY Bacteria_71) .: "
sqlite3 A_ciliaticola.db 'select count(*) from hmm_hits where source="Bacteria_71";'

Output on Midway:

$ bash test_models.sh

A contigs DB w/ 311 genes has been downloaded.

Num Bacteria_71 HMM hits found in contigs-db (MULTI-THREADED, ALL MODELS) .......: 8
Num Bacteria_71 HMM hits found in contigs-db (MULTI-THREADED, ONLY Bacteria_71) .: 63

This is what we need to fix.

meren avatar Dec 08 '21 12:12 meren

hmm, I think it is not limited to small genomes, but rather to genomes with a small number of contigs. I maintain anvio dbs for a large set of genomes (GTDB and a subset of GEM) and noticed issued with a subset of these when doing some phylogenomics with --max-num-genes-missing-from-bin

Off hand, i want to say the number of Bacteria_71 genes was off in about 5-10% of a set of gammaproteobacteria I was making a tree for.

I'll test the number of contigs remark and report back here

dspeth avatar Dec 08 '21 13:12 dspeth

I've manually split the A ciliaticola db in 11 contigs by arbitrarily inserting fasta headers, and ran @meren's test_issue_1748 script on the new database with a varying number of threads. With 10 threads (< # of contigs), there is no problem while a run with 15 threads (> # of contigs) causes issues.

Note the variable number of genes detected in the run with 15 threads. This also indicates how the issue can be easily missed. If the number of threads is marginally larger than the number of contigs in a db, the number of genes detected falls within completely reasonable numbers (as opposed to <10).

10 threads:

(anvio-7.1) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh 
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 318
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 63

(anvio-7.1) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh 
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 318
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 63

15 threads:

(anvio-7.1) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh 
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 318
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 54
(anvio-7.1) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh 
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 318
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 52

dspeth avatar Dec 08 '21 14:12 dspeth