PPanGGOLiN Manage annotation with joined coordinates

PPanGGOLiN encountered issues when handling joined coordinates present in input annotation files (GFF or GBFF). Such annotations were disregarded when encountered in GBFF files and improperly managed in GFF files.

This PR solves this issue by managing properly these annotations.

Implemented solution

[x] Joined annotation cases have been handled when parsing the input files to retrieve different start and stop coordinates.
[x] The information of the different coordinates is stored in an attribute coordinates of gene objects. This attribute is a list of start and stop tuples.
[x] To keep the information of the joined coordinates, a new table has been added in the hdf5 call joinedCoordinates and stores the multiple start and stop of joined genes.
[x] Some pytest functions have been added to check the correctness of the implementation.
[x] Ensure proper handling of circular RGPs, addressing issues observed in the spot plot (refer to issue #124) and ProkSee output.

Mar 28 '24 17:03 JeanMainguy

Examples of joined annotations

Joined coordinates in annotations can originate from two main sources: frameshift mutations and contig edge overlaps.

The gene is annotated with a frameshift.

Example in GBFF file:

     gene            complement(4142570..4143665)
                     /gene="prfB"
                     /locus_tag="PA3701"
                     /db_xref="GeneID:880384"
     CDS             complement(join(4142570..4143593,4143595..4143665))
                     /gene="prfB"
                     /locus_tag="PA3701"
                     /ribosomal_slippage
                     /codon_start=1
                     /transl_table=11
                     /product="peptide chain release factor 1"
                     /protein_id="YP_003933612.1"
                     /db_xref="GeneID:880384"

Example in GFF file:

Sequence	Source	Type	Start	End	_	Strand	Phase	Attributes
NC_002516.2	RefSeq	gene	4142570	4143665	.	-	.	ID=gene-PA3701
NC_002516.2	RefSeq	CDS	4143595	4143665	.	-	0	ID=cds-YP_003933612.1
NC_002516.2	RefSeq	CDS	4142570	4143593	.	-	1	ID=cds-YP_003933612.1

When a gene is annotated with a frameshift, it is represented by two CDS lines in the GFF file, each corresponding to a different piece of the gene.

The gene is overlapping the edge of a circular contig.

Nothing biological behind that just a bioinformatic issue...

Example in GBFF file:

     gene            join(1041974..1042567,1..1176)
                     /locus_tag="BKB96_RS00005"
                     /old_locus_tag="BKB96_00005"
     CDS             join(1041974..1042567,1..1176)
                     /locus_tag="BKB96_RS00005"
                     /old_locus_tag="BKB96_00005"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:NP_219502.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="hypothetical protein"
                     /protein_id="WP_100115054.1"

Another case where only one base of the gene is the last base of the circular contig

     gene            join(1038313,1..1016)
                     /locus_tag="CTL2C_RS00005"
     CDS             join(1038313,1..1016)
                     /locus_tag="CTL2C_RS00005"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:YP_001654092.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="porphobilinogen synthase"
                     /protein_id="WP_009873181.1"

Example in GFF file:

Sequence	Source	Type	Start	End	_	Strand	Phase	Attributes
NZ_CP017736.1	RefSeq	gene	1041974	1043743	.	+	.	ID=gene-BKB96_RS00005
NZ_CP017736.1	Protein Homology	CDS	1041974	1043743	.	+	0	ID=cds-WP_100115054.1

In a GFF file, when a gene extends beyond the boundary of a contig, it means that its end position exceeds the size of the contig. For instance, in the provided example, the contig named 'NZ_CP017736.1' has a size of 1042567 base pairs, but the gene's end position is recorded as 1043743, which surpasses the contig's length. This indicates that the gene extends beyond the edge of the contig, effectively overlapping it.

Apr 02 '24 16:04 JeanMainguy

Test on few pangenomes:

species	D.pigrum		P.aeruginosa		E.Coli
	Before	After	Before	After	Before	After
Genes	59999	59999	4 933 371	4 939 509	14 932 814	15 017 229
Genomes	32	32	802	802	3190	3190
Families	4123	4123	32 196	32 279	45 002	45 077
Edges	5742	5742	59 301	60 524	131 142	139 257
Persistent	1436	1436	5 181	5 028	3 137	3 135
Shell	477	477	5 982	6 286	7 250	7 234
Cloud	2210	2210	21 033	20 965	34 676	34 708
RGP	806	806	33 894	34 148	236 867	236 824
Spots	97	97	915	829	2 037	1 721
Modules	122	122	1 420	1422	2 219	2 219

The difference in RGP count comes from a difference in clustering, which affects partitioning.

May 27 '24 18:05 jpjarnoux

The following image shows the difference in RGP on the GCF_016904235.1_ASM1690423v1_CDS_2283 genome with proksee. cmp_proksee_RGP There is a deletion of RGP66 due to the change in partition of the ECK0501 gene family from shell to persistent. RGP66 RGPs 12 and 19 merge into RGP9 due to the change in partition of the ECK1494 gene family from persistent to shell. RGP9

Jun 03 '24 09:06 jpjarnoux

PPanGGOLiN PPanGGOLiN copied to clipboard

Manage annotation with joined coordinates

Implemented solution

Examples of joined annotations

The gene is annotated with a frameshift.

Example in GBFF file:

Example in GFF file:

The gene is overlapping the edge of a circular contig.

Example in GBFF file:

Example in GFF file:

PPanGGOLiN
PPanGGOLiN copied to clipboard