anvio icon indicating copy to clipboard operation
anvio copied to clipboard

[BUG] follow up #2020: list assignment index out of range

Open Hocnonsense opened this issue 6 months ago • 3 comments

Short description of the problem

when import gene from NCBI gff file, anvi'o failed

anvi'o version

$ anvi-self-test --version
Anvi'o .......................................: marie (v8)
Python .......................................: 3.10.15

Profile database .............................: 38
Contigs database .............................: 21
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

Windows11 -- Ubuntu (WSL2) -- conda install "anvio>=8"

Detailed description of the issue

I'd like to convert gff file downloaded from NCBI to anvio. However, a gene at the edge of plasmid results in exception:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM2016221v1
#!genome-build-accession NCBI_Assembly:GCF_020162215.1
#!annotation-date 11/06/2024 09:59:28
#!annotation-source NCBI RefSeq GCF_020162215.1-RS_2024_11_06
##sequence-region NZ_CP083939.1 1 138008
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=363952
NZ_CP083939.1	RefSeq	region	1	138008	.	+	.	ID=NZ_CP083939.1:1..138008;Dbxref=taxon:363952;Is_circular=true;Name=unnamed1;collection-date=2019;country=USA:California;gbkey=Src;genome=plasmid;isolation-source=American Alligator's water tank;mol_type=genomic DNA;plasmid-name=unnamed1;strain=CSUSB4
NZ_CP083939.1	RefSeq	gene	137518	139500	.	+	.	ID=gene-LCH15_RS25130;Name=LCH15_RS25130;gbkey=Gene;gene_biotype=protein_coding;locus_tag=LCH15_RS25130;old_locus_tag=LCH15_25130
NZ_CP083939.1	GeneMarkS-2+	CDS	137518	139500	.	+	0	ID=cds-WP_224651346.1;Parent=gene-LCH15_RS25130;Dbxref=GenBank:WP_224651346.1;Name=WP_224651346.1;gbkey=CDS;inference=COORDINATES: ab initio prediction:GeneMarkS-2+;locus_tag=LCH15_RS25130;product=GGDEF domain-containing protein;protein_id=WP_224651346.1;transl_table=11

Here, NZ_CP083939.1 is a circular plasmid, with a gene starts from 137518 to 139500 ((139500-137518+1)/3=660 amino acids + 1 stop codon). I translated the gene successfully by adding the copy of reference sequence at the end of it self. However, anvi'o seems don't like it.

Files / commands to reproduce the issue

I generated the input.fa and input.tsv from the genome annotation: fasta: NZ_CP083939.1 tsv:

gene_callers_id	contig	start	stop	direction	partial	call_type	source	version	aa_sequence
5213	NZ_CP083939.1	137517	139500	f	0	1	GeneMarkS-2+	v1.0	MELLAKIKEVGSLNGFLTAWTPMAVLLIGLGFNVVLYQFSVESVRERQRVYFDFRAREAVERIESRMATYQQVLRGAAGNFDSHEQVSREEFRNYADRQALDKHFPGIQGIGFAQAIAPNDLQTHITGVRAEGFPSYSVQPSGQRDLYSSIVYLEPFSGRNLRAFGFDMYSEPVRRTAMQRSVDTGEMALSGKVRLQQESGIKEQSGFLIYQAIYFRDRPITNIEERRAQLRGWVYAPFRMNDFLQGLFGEQGHDLIIDIFDGEQMLAKAQTHYGSLERPDIQQGLESVRTLHMMGQTWTIRVYASTSLLQRVDHRLPIFAAVAGVLLSSLVAGLVFLLVSAKSRAESAARAMTEDLSLEKTRLNAILEGTRVGTWEWNVQTGQTVFNEQWAAIVGYQLAELKPLSIQTWIGMVHPEDLQKSDLLIKQHLAGETPFYECEARMRHKQGHWVWVLDRGKIGKWSEDGKPLIMYGTHQDITREKQKMDTYHHGAHHDVLTDLPNRILLSDRLSQSLALAEREQTHVALLFMDLDGFKLINDNHGHDAGDVVLQTVARRIQQCIRASDTLARVGGDEFVALLQDAGDEQEALKMANLFNSEVRRPIPLPDGGEGHVSLSIGIAIYPQHGITAELLSEHADQAMYQVKKSSKNAALVYRSGGAA

And this is how I run:

$ anvi-gen-contigs-database             -f genes.fa             -n GCF_020162215.1.anvio             -o genes.db             -T 1 --split-length 0 --force-overwrite             --external-gene-calls genes.tsv
Input FASTA file .............................: ./genes.fa
Name .........................................: GCF_020162215.1.anvio
Description ..................................: No description is given
External gene calls ..........................: 1 gene calls recovered and will be processed.                                               

WARNING
===============================================
Anvi'o found amino acid sequences in your external gene calls file that match to
1 of 1 gene in it and will use these amino acid sequences for everything.


EXTERNAL GENE CALLS PARSER REPORT
===============================================
Num gene calls in file .......................: 1
Non-coding gene calls ........................: 0
Partial gene calls ...........................: 0
Num amino acid sequences provided ............: 1
  - For complete gene calls ..................: 1
  - For partial gene calls ...................: 0
Frames predicted .............................: 0
  - For complete gene calls ..................: 0
  - For partial gene calls ...................: 0
Gene calls marked as NONCODING ...............: 0
  - For complete gene calls ..................: 0
  - For partial gene calls ...................: 0
Gene calls with internal stops ...............: 0
  - For complete gene calls ..................: 0
  - For partial gene calls ...................: 0

                                                                                                                                            
CONTIGS DB CREATE REPORT
===============================================
Split Length .................................: 9,223,372,036,854,775,807
K-mer size ...................................: 4
Skip gene calling? ...........................: False
External gene calls provided? ................: True
External gene calls file have AA sequences? ..: True
Proper frames will be predicted? .............: True
Ignoring internal stop codons? ...............: False
Splitting pays attention to gene calls? ......: True
[04 Jul 25 14:58:01 The South Loop] Contig "1" has 1 genes, and 138008 nts. Now computing: auxiliary ...                          ETA: NoneTraceback (most recent call last):
  File "$CONDA_PREFIX/lib/python3.10/site-packages/anvio/dbops.py", line 4671, in compress_nt_position_info
    nt_position_info_list[nt_position + 2] = 1
IndexError: list assignment index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "$CONDA_PREFIX/bin/anvi-gen-contigs-database", line 63, in <module>
    main(args)
  File "$CONDA_PREFIX/lib/python3.10/site-packages/anvio/terminal.py", line 915, in wrapper
    program_method(*args, **kwargs)
  File "$CONDA_PREFIX/bin/anvi-gen-contigs-database", line 32, in main
    a.create(args)
  File "$CONDA_PREFIX/lib/python3.10/site-packages/anvio/dbops.py", line 4529, in create
    nt_position_info_list = self.compress_nt_position_info(contig_name, contig_length, genes_in_contig, genes_in_contigs_dict)
  File "$CONDA_PREFIX/lib/python3.10/site-packages/anvio/dbops.py", line 4685, in compress_nt_position_info
    nt_position_info_list[nt_position] = 8
IndexError: list assignment index out of range

Extra info

By the way, I've seen the warning many times, but for this case, the problem still exists:

SNAFU WITH CONTIG `NZ_JBDUFS010000666.1`
===============================================
There were some problmes while anvi'o was trying to identify which nucleotides
occur in which codon positions. These problems occurred in the following gene
calls: 5413. Please read this message for more information:
https://github.com/merenlab/anvio/issues/2020#issuecomment-1341028673.

By the way 2, is there any official entry than I can convert the popular annotation file (gff, gbk) into anvio.db quickly?

Thakks! Aoran Hu

Hocnonsense avatar Jul 04 '25 07:07 Hocnonsense

Dear anvi'o developers,

I have been reviewing the compress_nt_position_info function in https://github.com/merenlab/anvio/blob/f37c0786002548374e76397067e6553134d4cb95/anvio/dbops.py#L4856C1-L4897) and its relation to issues #1661 and #2020. I have some concerns and suggestions regarding the handling of gene calls at contig boundaries and the treatment of eukaryotic genes.

Regarding Issue #1661

I am unclear on how commit 132d98d resolves issue #1661. The issue seems to involve genes with a stop position equal to the contig length (stop == contig_length). Skipping such genes (as done with if stop == contig_length and stop - start != contig_length: continue) risks excluding valid genes that end at the contig boundary, especially if upstream bases are present. Could you clarify why these genes are skipped rather than processed normally, given that their coordinates are within the contig's range?

Regarding Issue #2020

I respectfully disagree with the comment in #2020 suggesting that eukaryotic gene lengths often satisfy gene_length % 3 > 0 due to introns. In standard genomic annotation, gene length typically refers to the coding sequence (exons only, after splicing), not including introns. Treating eukaryotic genes as prokaryotic open reading frames (ORFs) may lead to errors, as their structures differ significantly. The current code does not appear to account for exon-intron boundaries. Would it be feasible to modify the data structure to record exon coordinates and concatenate them for accurate codon position assignment? Additionally, how does anvi'o handle overlapping genes, which are common in prokaryotes?

Furthermore, the issue description in #2020 mentions two scenarios: (1) gene_length % 3 != 0, and (2) genes at contig boundaries causing IndexError due to floating-point rounding. For external gene calls, start and stop positions are typically integers, so rounding issues should not occur. If a gene predictor assigns coordinates beyond the contig length, this indicates an error in the input data, which should be filtered before processing in anvi'o rather than handled with a try-except block. Alternatively, out-of-range gene calls may suggest a circular contig, where the gene wraps around from the end to the start (I could not find a mechanism in the contig database to indicate whether a contig is circular). Could this be considered in future updates?

Proposed Solution

To address these boundary issues more robustly and avoid reliance on try-except, I suggest two alternative approaches for the codon position assignment loop:

  1. Trim Gene Coordinates: Limit the gene's stop position to the contig length to prevent index errors.
            if gene_call['direction'] == 'f':
                for nt_position in range(start, min(stop, contig_length), 3):
                    nt_position_info_list[nt_position] = 4
                    nt_position_info_list[nt_position + 1] = 2
                    nt_position_info_list[nt_position + 2] = 1
            elif gene_call['direction'] == 'r':
                for nt_position in range(min(stop - 1, contig_length), start - 1, -3):
                    nt_position_info_list[nt_position] = 4
                    nt_position_info_list[nt_position - 1] = 2
                    nt_position_info_list[nt_position - 2] = 1
  1. Support Circular Contigs: Use modulo to wrap coordinates around the contig start, accommodating circular contigs, I like this but it can be bad if the other part of anvi'o do not support this.
            if gene_call['direction'] == 'f':
                for nt_position in range(start, stop, 3):
                    nt_position_info_list[(nt_position) % contig_length] = 4
                    nt_position_info_list[(nt_position + 1) % contig_length] = 2
                    nt_position_info_list[(nt_position + 2) % contig_length] = 1
            elif gene_call['direction'] == 'r':
                for nt_position in range(stop - 1, start - 1, -3):
                    nt_position_info_list[(nt_position) % contig_length] = 4
                    nt_position_info_list[(nt_position - 1) % contig_length] = 2
                    nt_position_info_list[(nt_position - 2) % contig_length] = 1

These approaches ensure robustness by either trimming out-of-range coordinates or supporting circular contigs explicitly, reducing the need for error-catching. Could the team consider these solutions or share insights on why the current approach was chosen?

Thank you for your time and the excellent work on anvi'o. I look forward to your feedback and any clarification on these points.

Best regards, Aoran Hu

Hocnonsense avatar Jul 04 '25 08:07 Hocnonsense

Dear @Hocnonsense,

Thank you very much for this detailed report and discussion. I am very interested in diving into this but I want to make sure you experience these issues also with the anvi'o development branch. We're a bit late in our schedule to make a release unfortunately, so a lot of features and fixes are stuck in anvio-dev and not in v8 :(

Best wishes,

meren avatar Jul 04 '25 12:07 meren

Sure. although I havn't test it using anvio-dev, but I assume that the master branch is the newst version. After checking anvio/dbops.py, I think this is the same as anvio-v8

https://github.com/merenlab/anvio/blob/9f079b3952bf8add49bbf6312bf1b5fca5c255b5/anvio/dbops.py#L5136-L5256

For testing, here is the demo I used, as mentioned above:

contig: download NZ_CP083939.1 and save as genes.fa tsv:

gene_callers_id	contig	start	stop	direction	partial	call_type	source	version	aa_sequence
5213	NZ_CP083939.1	137517	139500	f	0	1	GeneMarkS-2+	v1.0	MELLAKIKEVGSLNGFLTAWTPMAVLLIGLGFNVVLYQFSVESVRERQRVYFDFRAREAVERIESRMATYQQVLRGAAGNFDSHEQVSREEFRNYADRQALDKHFPGIQGIGFAQAIAPNDLQTHITGVRAEGFPSYSVQPSGQRDLYSSIVYLEPFSGRNLRAFGFDMYSEPVRRTAMQRSVDTGEMALSGKVRLQQESGIKEQSGFLIYQAIYFRDRPITNIEERRAQLRGWVYAPFRMNDFLQGLFGEQGHDLIIDIFDGEQMLAKAQTHYGSLERPDIQQGLESVRTLHMMGQTWTIRVYASTSLLQRVDHRLPIFAAVAGVLLSSLVAGLVFLLVSAKSRAESAARAMTEDLSLEKTRLNAILEGTRVGTWEWNVQTGQTVFNEQWAAIVGYQLAELKPLSIQTWIGMVHPEDLQKSDLLIKQHLAGETPFYECEARMRHKQGHWVWVLDRGKIGKWSEDGKPLIMYGTHQDITREKQKMDTYHHGAHHDVLTDLPNRILLSDRLSQSLALAEREQTHVALLFMDLDGFKLINDNHGHDAGDVVLQTVARRIQQCIRASDTLARVGGDEFVALLQDAGDEQEALKMANLFNSEVRRPIPLPDGGEGHVSLSIGIAIYPQHGITAELLSEHADQAMYQVKKSSKNAALVYRSGGAA

command:

anvi-gen-contigs-database -f genes.fa -n GCF_020162215.1.anvio -o genes.db -T 1 --split-length 0 --force-overwrite --external-gene-calls genes.tsv

Hoping the newset version will come soon!

Hocnonsense avatar Jul 04 '25 12:07 Hocnonsense