PPanGGOLiN
PPanGGOLiN copied to clipboard
Manage annotation with joined coordinates
PPanGGOLiN encountered issues when handling joined coordinates present in input annotation files (GFF or GBFF). Such annotations were disregarded when encountered in GBFF files and improperly managed in GFF files.
This PR solves this issue by managing properly these annotations.
Implemented solution
- [x] Joined annotation cases have been handled when parsing the input files to retrieve different start and stop coordinates.
- [x] The information of the different coordinates is stored in an attribute coordinates of gene objects. This attribute is a list of start and stop tuples.
- [x] To keep the information of the joined coordinates, a new table has been added in the hdf5 call
joinedCoordinates
and stores the multiple start and stop of joined genes. - [x] Some pytest functions have been added to check the correctness of the implementation.
- [x] Ensure proper handling of circular RGPs, addressing issues observed in the spot plot (refer to issue #124) and ProkSee output.
Examples of joined annotations
Joined coordinates in annotations can originate from two main sources: frameshift mutations and contig edge overlaps.
The gene is annotated with a frameshift.
Example in GBFF file:
gene complement(4142570..4143665)
/gene="prfB"
/locus_tag="PA3701"
/db_xref="GeneID:880384"
CDS complement(join(4142570..4143593,4143595..4143665))
/gene="prfB"
/locus_tag="PA3701"
/ribosomal_slippage
/codon_start=1
/transl_table=11
/product="peptide chain release factor 1"
/protein_id="YP_003933612.1"
/db_xref="GeneID:880384"
Example in GFF file:
Sequence | Source | Type | Start | End | _ | Strand | Phase | Attributes |
---|---|---|---|---|---|---|---|---|
NC_002516.2 | RefSeq | gene | 4142570 | 4143665 | . | - | . | ID=gene-PA3701 |
NC_002516.2 | RefSeq | CDS | 4143595 | 4143665 | . | - | 0 | ID=cds-YP_003933612.1 |
NC_002516.2 | RefSeq | CDS | 4142570 | 4143593 | . | - | 1 | ID=cds-YP_003933612.1 |
When a gene is annotated with a frameshift, it is represented by two CDS lines in the GFF file, each corresponding to a different piece of the gene.
The gene is overlapping the edge of a circular contig.
Nothing biological behind that just a bioinformatic issue...
Example in GBFF file:
gene join(1041974..1042567,1..1176)
/locus_tag="BKB96_RS00005"
/old_locus_tag="BKB96_00005"
CDS join(1041974..1042567,1..1176)
/locus_tag="BKB96_RS00005"
/old_locus_tag="BKB96_00005"
/inference="COORDINATES: similar to AA
sequence:RefSeq:NP_219502.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="WP_100115054.1"
Another case where only one base of the gene is the last base of the circular contig
gene join(1038313,1..1016)
/locus_tag="CTL2C_RS00005"
CDS join(1038313,1..1016)
/locus_tag="CTL2C_RS00005"
/inference="COORDINATES: similar to AA
sequence:RefSeq:YP_001654092.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="porphobilinogen synthase"
/protein_id="WP_009873181.1"
Example in GFF file:
Sequence | Source | Type | Start | End | _ | Strand | Phase | Attributes |
---|---|---|---|---|---|---|---|---|
NZ_CP017736.1 | RefSeq | gene | 1041974 | 1043743 | . | + | . | ID=gene-BKB96_RS00005 |
NZ_CP017736.1 | Protein Homology | CDS | 1041974 | 1043743 | . | + | 0 | ID=cds-WP_100115054.1 |
In a GFF file, when a gene extends beyond the boundary of a contig, it means that its end position exceeds the size of the contig. For instance, in the provided example, the contig named 'NZ_CP017736.1' has a size of 1042567 base pairs, but the gene's end position is recorded as 1043743, which surpasses the contig's length. This indicates that the gene extends beyond the edge of the contig, effectively overlapping it.
Test on few pangenomes:
species | D.pigrum | P.aeruginosa | E.Coli | |||
---|---|---|---|---|---|---|
Before | After | Before | After | Before | After | |
Genes | 59999 | 59999 | 4 933 371 | 4 939 509 | 14 932 814 | 15 017 229 |
Genomes | 32 | 32 | 802 | 802 | 3190 | 3190 |
Families | 4123 | 4123 | 32 196 | 32 279 | 45 002 | 45 077 |
Edges | 5742 | 5742 | 59 301 | 60 524 | 131 142 | 139 257 |
Persistent | 1436 | 1436 | 5 181 | 5 028 | 3 137 | 3 135 |
Shell | 477 | 477 | 5 982 | 6 286 | 7 250 | 7 234 |
Cloud | 2210 | 2210 | 21 033 | 20 965 | 34 676 | 34 708 |
RGP | 806 | 806 | 33 894 | 34 148 | 236 867 | 236 824 |
Spots | 97 | 97 | 915 | 829 | 2 037 | 1 721 |
Modules | 122 | 122 | 1 420 | 1422 | 2 219 | 2 219 |
The difference in RGP count comes from a difference in clustering, which affects partitioning.
The following image shows the difference in RGP on the GCF_016904235.1_ASM1690423v1_CDS_2283 genome with proksee.
There is a deletion of RGP66 due to the change in partition of the ECK0501 gene family from shell to persistent.
RGPs 12 and 19 merge into RGP9 due to the change in partition of the ECK1494 gene family from persistent to shell.