[BUG] anvi-run-hmms silently fails when multithreaded
Short description of the problem
anvi-run-hmms silently fails when multithreaded, and run on a single (or low number of) contigs
anvi'o version
Anvi'o .......................................: hope (v7)
Profile database .............................: 35
Contigs database .............................: 20
Pan database .................................: 14
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 2
tRNA-seq database ............................: 1
System info
Linux version 4.9.0-13-amd64 ([email protected]) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.228-1 (2020-07-05)
Detailed description of the issue
Hello anvio team, i hope this bug is something you have already encountered and fixed in the master branch, since it concerns HMMs, and I did notice you have spent quite a bit of energy on handling them recently. However, I could not find an issue specifically referencing this problem, so I'll open a new one.
I was building a phylogenomic tree using the steps described in the anvio phylogenomic workflow. My target was a recently published gammaproteobacterial endosymbiont ( https://www.ncbi.nlm.nih.gov/nuccore/LR794158), and so I was curious what the alignment for this specific organism looked like. However, a quick grep of the concatenated protein fastafile did not yield anything, which was somewhat surprising. Some other contigs databases, which were in multiple contigs, seemed to be fine.
(Note: a potential side issue is that despite it's many warnings anvi-get-sequences-for-hmm-hits does not warn you when one of your contigs dbs does not have any hits to the HMM collection of choice, it juts omits them from the final alignment)
some sleuthing ensued, and I realized I had used anvi-run-hmms with the -T 50 flag. (Did I really use 50 cores to run HMMs on a 300kb genome you ask? Yes I did, and so I maybe deserved the ensuing mess for being wasteful). Taking the contigs db and running anvi-run-hmms yielded a decreasing number of hits to the Bacteria_71 HMM collection, with increasing number of cores, hitting 0 somewhere between 40-50, as detailed by the pasted output of the number of hits to the collection below.
Since this seemed to behave fine for genomes in more pieces (no systematic testing done), I think it might have something to do with the partitioning of a single sequence during the multithreading.
Hope this helps! Daan
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db
HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified, https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmpd645k8uj/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmp2lasbim2
Log file for thread 0 ........................: /tmp/tmp2lasbim2/AA_gene_sequences.fa.0_log
Number of raw hits ...........................: 63
Number of weak hits removed ..................: 0
Number of hits in annotation dict ...........: 63
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 2
HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmpph05hta4/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmp_louykze
Log file for thread 0 ........................: /tmp/tmp_louykze/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmp_louykze/AA_gene_sequences.fa.1_log
Number of raw hits ...........................: 21
Number of weak hits removed ..................: 0
Number of hits in annotation dict ...........: 21
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 3
HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmp7_5b1ouh/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmp6lvjpvem
Log file for thread 0 ........................: /tmp/tmp6lvjpvem/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmp6lvjpvem/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmp6lvjpvem/AA_gene_sequences.fa.2_log
Number of raw hits ...........................: 17
Number of weak hits removed ..................: 0
Number of hits in annotation dict ...........: 17
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 4
HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmp_a8836k8/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmpq0xwm06_
Log file for thread 0 ........................: /tmp/tmpq0xwm06_/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmpq0xwm06_/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmpq0xwm06_/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmpq0xwm06_/AA_gene_sequences.fa.3_log
Number of raw hits ...........................: 10
Number of weak hits removed ..................: 0
Number of hits in annotation dict ...........: 10
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 5
HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmpknk0_xq0/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmplb4oin0s
Log file for thread 0 ........................: /tmp/tmplb4oin0s/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmplb4oin0s/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmplb4oin0s/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmplb4oin0s/AA_gene_sequences.fa.3_log
Log file for thread 4 ........................: /tmp/tmplb4oin0s/AA_gene_sequences.fa.4_log
Number of raw hits ...........................: 8
Number of weak hits removed ..................: 0
Number of hits in annotation dict ...........: 8
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 10
HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmp0ue00x54/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmpw0a4b9ro
Log file for thread 0 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.3_log
Log file for thread 4 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.4_log
Log file for thread 5 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.5_log
Log file for thread 6 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.6_log
Log file for thread 7 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.7_log
Log file for thread 8 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.8_log
Log file for thread 9 ........................: /tmp/tmpw0a4b9ro/AA_gene_sequences.fa.9_log
Number of raw hits ...........................: 8
Number of weak hits removed ..................: 0
Number of hits in annotation dict ...........: 8
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 20
HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmptsqxt8yh/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmpbihm5bh0
Log file for thread 0 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.3_log
Log file for thread 4 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.4_log
Log file for thread 5 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.5_log
Log file for thread 6 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.6_log
Log file for thread 7 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.7_log
Log file for thread 8 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.8_log
Log file for thread 9 ........................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.9_log
Log file for thread 10 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.10_log
Log file for thread 11 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.11_log
Log file for thread 12 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.12_log
Log file for thread 13 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.13_log
Log file for thread 14 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.14_log
Log file for thread 15 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.15_log
Log file for thread 16 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.16_log
Log file for thread 17 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.17_log
Log file for thread 18 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.18_log
Log file for thread 19 .......................: /tmp/tmpbihm5bh0/AA_gene_sequences.fa.19_log
Number of raw hits ...........................: 2
Number of weak hits removed ..................: 0
Number of hits in annotation dict ...........: 2
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 40
HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmpf0j2rpes/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmp9s9t5sql
Log file for thread 0 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.3_log
Log file for thread 4 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.4_log
Log file for thread 5 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.5_log
Log file for thread 6 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.6_log
Log file for thread 7 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.7_log
Log file for thread 8 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.8_log
Log file for thread 9 ........................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.9_log
Log file for thread 10 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.10_log
Log file for thread 11 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.11_log
Log file for thread 12 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.12_log
Log file for thread 13 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.13_log
Log file for thread 14 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.14_log
Log file for thread 15 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.15_log
Log file for thread 16 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.16_log
Log file for thread 17 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.17_log
Log file for thread 18 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.18_log
Log file for thread 19 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.19_log
Log file for thread 20 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.20_log
Log file for thread 21 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.21_log
Log file for thread 22 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.22_log
Log file for thread 23 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.23_log
Log file for thread 24 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.24_log
Log file for thread 25 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.25_log
Log file for thread 26 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.26_log
Log file for thread 27 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.27_log
Log file for thread 28 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.28_log
Log file for thread 29 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.29_log
Log file for thread 30 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.30_log
Log file for thread 31 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.31_log
Log file for thread 32 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.32_log
Log file for thread 33 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.33_log
Log file for thread 34 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.34_log
Log file for thread 35 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.35_log
Log file for thread 36 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.36_log
Log file for thread 37 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.37_log
Log file for thread 38 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.38_log
Log file for thread 39 .......................: /tmp/tmp9s9t5sql/AA_gene_sequences.fa.39_log
Number of raw hits ...........................: 1
Number of weak hits removed ..................: 0
Number of hits in annotation dict ...........: 1
anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -T 50
HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified,
https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /tmp/tmp_muvxx5r/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /tmp/tmp33y1315u
Log file for thread 0 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.0_log
Log file for thread 1 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.1_log
Log file for thread 2 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.2_log
Log file for thread 3 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.3_log
Log file for thread 4 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.4_log
Log file for thread 5 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.5_log
Log file for thread 6 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.6_log
Log file for thread 7 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.7_log
Log file for thread 8 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.8_log
Log file for thread 9 ........................: /tmp/tmp33y1315u/AA_gene_sequences.fa.9_log
Log file for thread 10 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.10_log
Log file for thread 11 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.11_log
Log file for thread 12 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.12_log
Log file for thread 13 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.13_log
Log file for thread 14 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.14_log
Log file for thread 15 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.15_log
Log file for thread 16 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.16_log
Log file for thread 17 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.17_log
Log file for thread 18 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.18_log
Log file for thread 19 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.19_log
Log file for thread 20 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.20_log
Log file for thread 21 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.21_log
Log file for thread 22 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.22_log
Log file for thread 23 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.23_log
Log file for thread 24 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.24_log
Log file for thread 25 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.25_log
Log file for thread 26 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.26_log
Log file for thread 27 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.27_log
Log file for thread 28 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.28_log
Log file for thread 29 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.29_log
Log file for thread 30 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.30_log
Log file for thread 31 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.31_log
Log file for thread 32 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.32_log
Log file for thread 33 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.33_log
Log file for thread 34 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.34_log
Log file for thread 35 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.35_log
Log file for thread 36 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.36_log
Log file for thread 37 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.37_log
Log file for thread 38 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.38_log
Log file for thread 39 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.39_log
Log file for thread 40 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.40_log
Log file for thread 41 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.41_log
Log file for thread 42 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.42_log
Log file for thread 43 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.43_log
Log file for thread 44 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.44_log
Log file for thread 45 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.45_log
Log file for thread 46 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.46_log
Log file for thread 47 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.47_log
Log file for thread 48 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.48_log
Log file for thread 49 .......................: /tmp/tmp33y1315u/AA_gene_sequences.fa.49_log
Number of raw hits ...........................: 0
* The HMM source 'Bacteria_71' returned 0 hits. SAD (but it's stil OK).
Files to reproduce
contig database generated from https://www.ncbi.nlm.nih.gov/nuccore/LR794158
Thank you very much for the detailed report and files/commands to reproduce the issue, Daan!
One thing I'm missing here is the version of the HMMER you have installed in your anvi'o environment :)
HMMER 3.2.1 anvio v7 was installed following the conda/pip instructions on the installation page, roughly 2 weeks ago
@ivagljiva, I feel like the main branch does not suffer from this anymore, but can you please take a look at this if you find a moment today? Otherwise I will look into it later, so no worries. Thank you!
I will take a look @meren :)
@dspeth and @meren, I have tested this in the development branch and it happily appears to be something we already fixed :)
I was able to increase from 1 to 37 threads and still get 63 hits from Bacteria_71 every time:
$ anvi-run-hmms --just-do-it -c A_ciliaticola_LR794158.db -I Bacteria_71 -T 37
WARNING
===============================================
Previous entries for "Bacteria_71" is being removed from "hmm_hits_info,
hmm_hits, hmm_hits_in_splits, genes_in_contigs, gene_functions"
Contigs DB ...................................: A_ciliaticola_LR794158.db
HMM sources ..................................: Bacteria_71
Alphabet/context target found ................: AA:GENE
HMM Profiling for Bacteria_71
===============================================
Reference ....................................: Lee modified, https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmp3wzcwjzt/Bacteria_71.hmm
Number of genes in HMM model .................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 37
HMMer program used for search ................: hmmscan
Temporary work dir ...........................: /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmpar4jwpi0
Log file for thread 0 ........................: /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmpar4jwpi0/AA_gene_sequences.fa.0_log
[...........]
Log file for thread 36 .......................: /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmpar4jwpi0/AA_gene_sequences.fa.36_log
Number of raw hits in table file .............: 63
Number of weak hits removed by HMMER parser ..: 0
Number of hits in annotation dict ...........: 63
I didn't test with more threads because I started getting error tracebacks for overloading my poor laptop, but happy to throw this onto the cluster for more extensive testing if you are worried that the trend wouldn't hold for higher numbers.
a potential side issue is that despite it's many warnings anvi-get-sequences-for-hmm-hits does not warn you when one of your contigs dbs does not have any hits to the HMM collection of choice, it juts omits them from the final alignment
I think you are right that we need a warning for this, @dspeth. I will work on adding one :)
I have added a warning (commit), and now if you run anvi-get-sequences-for-hmm-hits on a database that does not have hits, you should see something like the following:
WARNING
===============================================
SequencesForHMMHits class here. The current database (at
/Users/iva/Downloads/HMM_BUG/no_hits.db) contains 0 HMM hits, at least within
the HMM sources or splits that were requested. It might not be a problem for
your case, but we just thought you should know, in case it is. So there you have
it.
thanks @ivagljiva & @meren!
Hi @ivagljiva & @meren,
I'm sad to report that I encountered this issue again yesterday, both in the master branch and version 7.1.
I was alerted to it when using anvi-get-sequences-for-hmm-hits with the --max-num-genes-missing-from-bin flag (which is a great addition, by the way :) ) and losing genomes that I know for a fact had the minimum required of genes.
A quick test of re-running hmms with 20 threads in both 7.1 and master gave me 0 hits, and subsequently using 1 thread I got the correct number (60).
This is craaaaayyyy.
So you have a contigs db that we can use to reproduce this? Can you anvi-split the genome of interest into its own contigs db and send it here so we have a minimal case to test this?
Apologies for this and thanks a lot for letting us know, Daan :(
Hi @meren, I've emailed you the contigs dbs where this issue occurred for me. They are generated from this genome: https://www.ncbi.nlm.nih.gov/nuccore/LR794158 Oddly, this is the genome that @ivagljiva tested the behaviour on earlier, and found no issues
That said, I noticed it in many contigs dbs on my system. Some of those must have been migrated, but not sure that this would cause this problem.
To test that last statement I just freshly generated the contigs db from the genome fasta (with simplified name), and get the defective behaviour in both the master branch and v7.1
ah, rereading the issue thread: my HMMER version is 3.3.2 in both conda envs, so a different version than was installed when I encountered this issue first
We'll check with all versions and report back, Daan. Thank you!
Hey @dspeth, I am unable to reproduce this. First, I generated this test script here (and put it in a file called test_issue_1748.sh):
# check HMMER version
echo -n "HMMER version: "
hmmscan -h | grep HMMER
# get rid of the db if exists
rm -rf A_ciliaticola.db
# generate a fresh contigs db from the FASTA
anvi-gen-contigs-database -f A_ciliaticola_LR794158_sn.fa -o A_ciliaticola.db --quiet
echo -n "Number of genes Prodigal found in the genome: "
sqlite3 A_ciliaticola.db 'select count(*) from genes_in_contigs where source="prodigal";'
# create a copy of the vanilla contigs db as '.orig'
cp A_ciliaticola.db A_ciliaticola.db.orig
# run anvi-run-hmms with a single thread
anvi-run-hmms -c A_ciliaticola.db --quiet
echo -n "Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): "
sqlite3 A_ciliaticola.db 'select count(*) from hmm_hits where source="Bacteria_71";'
# replace the db with the original one
cp A_ciliaticola.db.orig A_ciliaticola.db
# run anvi-run-hmms with multiple threads
anvi-run-hmms -c A_ciliaticola.db -T 6 --quiet
echo -n "Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): "
sqlite3 A_ciliaticola.db 'select count(*) from hmm_hits where source="Bacteria_71";'
I run it first on my active branch, which is running HMMER v3.3 from Nov 2019 (both single-threaded and multi-threaded runs found 63 SCGs):
time bash test_issue_1748.sh
HMMER version: # HMMER 3.3 (Nov 2019); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 63
real 0m16.363s
user 0m16.181s
sys 0m4.709s
Then I run it using a fresh v7.1 setup, which is running HMMER v3.3.2 from Nov 2020 (and again, both single-threaded and multi-threaded runs found 63 SCGs):
time bash test_issue_1748.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 63
real 0m16.548s
user 0m18.558s
sys 0m5.308s
The test file is here, please try it on your system and send the output.
Here it is @meren :
v7.1: HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/ Number of genes Prodigal found in the genome: 311 Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63. Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 8
master: HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/ Number of genes Prodigal found in the genome: 311 Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63 Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 8
what I had noticed (at the time I first reported this), is that the number of genes reported does not seem to be fixed for a given number of threads, when the number of threads gets really high. I'll try to see if I can illustrate what I mean, and post.
my best guess is that some file/variable gets overwritten, giving only the output of the last thread to finish. With a larger number of threads, I think that isn't always the same one.
@meren Here's an example showing this behaviour, with number of threads set to 30:
(anvi_dev) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 0
(anvi_dev) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 3
(anvi_dev) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 0
As you can see the first and third run yield 0 hits, while the second gives 3.
I don't know whether it's easy to check where/how the output of the multithreading gets merged in anvi-run-hmms but that would be where I think the problem is
The output from each thread gets put onto a queue after the thread finishes, and then each portion of the output is appended to a single file. This all happens in the HMMER driver class in anvio/drivers/hmmer.py, so that is likely where the issue is coming from.
I still have no idea how I can not reproduce it on a Mac system with the same version of HMMER. I'm afraid this is related to architecture depended differences how Python threads are handled. If it is the case, this is very bad news :(
I'm afraid this is related to architecture depended differences how Python threads are handled
Oh no :( The multiprocessing class (which is the one we use) is supposed to run on both Windows and Unix, but there is no mention of Linux :(
Crap.
:: anvi'o dev :: SSH://MIDWAY ~ >>> bash test.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 311
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 8
That is very not good :(
I guess the next thing to do is to find out what causes this on Linux systems :/
Oh no :( The multiprocessing class (which is the one we use) is supposed to run on both Windows and Unix, but there is no mention of Linux :(
Usually something that works with 'Unix systems' include Linux. So the library is certainly for Linux.
I believe the difference will come from how processes are forked and completed in different operating systems will require tiny adjustments. I'm sure people already wrote a ton about this. It is pretty late here. But I will try to get back to this soon.
Any help / input from anyone would be most welcome, of course. I don't want to make it sound like I want to solve this alone :)
aiaiai, sorry to hear this.
I agree that the library must work on linux, otherwise there would be many very noisy errors.
Also, I might do some googling, but unfortunately think we've reached the end of my skillset :(
We need to compare whether the raw files the program produces when it is run with multiple threads on Mac and Linux systems are identical one not. This is very annoying.
Thank you for bringing this to our attention, @dspeth. I'm sure we will find a way to address this :)
A new development that shows that this is related to how we run different models. it may be a race condition that influence small genomes.
Here is a fully reproducible test script:
#!/bin/bash
set -e
# download the contigs db
rm -rf A_ciliaticola.db.orig
curl -L -o A_ciliaticola.db.orig https://www.dropbox.com/s/z172lpj87fcofrr/A_ciliaticola.db.orig
num_genes=`sqlite3 A_ciliaticola.db 'select count(*) from genes_in_contigs where source="prodigal";'`
echo
echo "A contigs DB w/ $num_genes genes has been downloaded. Running HMMs."
echo
# vanilla contigs db
cp A_ciliaticola.db.orig A_ciliaticola.db
# run all HMMs in multiple threads
anvi-run-hmms -c A_ciliaticola.db -T 6 --quiet
echo -n "Num Bacteria_71 HMM hits found in contigs-db (MULTI-THREADED, ALL MODELS) .......: "
sqlite3 A_ciliaticola.db 'select count(*) from hmm_hits where source="Bacteria_71";'
# vanilla contigs db
cp A_ciliaticola.db.orig A_ciliaticola.db
# run all HMMs in multiple threads
anvi-run-hmms -c A_ciliaticola.db -T 6 -I Bacteria_71 --quiet
echo -n "Num Bacteria_71 HMM hits found in contigs-db (MULTI-THREADED, ONLY Bacteria_71) .: "
sqlite3 A_ciliaticola.db 'select count(*) from hmm_hits where source="Bacteria_71";'
Output on Midway:
$ bash test_models.sh
A contigs DB w/ 311 genes has been downloaded.
Num Bacteria_71 HMM hits found in contigs-db (MULTI-THREADED, ALL MODELS) .......: 8
Num Bacteria_71 HMM hits found in contigs-db (MULTI-THREADED, ONLY Bacteria_71) .: 63
This is what we need to fix.
hmm, I think it is not limited to small genomes, but rather to genomes with a small number of contigs. I maintain anvio dbs for a large set of genomes (GTDB and a subset of GEM) and noticed issued with a subset of these when doing some phylogenomics with --max-num-genes-missing-from-bin
Off hand, i want to say the number of Bacteria_71 genes was off in about 5-10% of a set of gammaproteobacteria I was making a tree for.
I'll test the number of contigs remark and report back here
I've manually split the A ciliaticola db in 11 contigs by arbitrarily inserting fasta headers, and ran @meren's test_issue_1748 script on the new database with a varying number of threads. With 10 threads (< # of contigs), there is no problem while a run with 15 threads (> # of contigs) causes issues.
Note the variable number of genes detected in the run with 15 threads. This also indicates how the issue can be easily missed. If the number of threads is marginally larger than the number of contigs in a db, the number of genes detected falls within completely reasonable numbers (as opposed to <10).
10 threads:
(anvio-7.1) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 318
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 63
(anvio-7.1) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 318
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 63
15 threads:
(anvio-7.1) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 318
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 54
(anvio-7.1) dspeth@bigmem-7:/scratch/dspeth/workdir_temp/anvi_hmm_test$ ./test_issue_1748.sh
HMMER version: # HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Number of genes Prodigal found in the genome: 318
Number of Bacteria_71 HMM hits found in genome (SINGLE-THREADED HMM run): 63
Number of Bacteria_71 HMM hits found in genome ( MULTI-THREADED HMM run): 52