PPanGGOLiN icon indicating copy to clipboard operation
PPanGGOLiN copied to clipboard

Manage annotation with joined coordinates

Open JeanMainguy opened this issue 10 months ago • 1 comments

PPanGGOLiN encountered issues when handling joined coordinates present in input annotation files (GFF or GBFF). Such annotations were disregarded when encountered in GBFF files and improperly managed in GFF files.

This PR solves this issue by managing properly these annotations.

Implemented solution

  • [x] Joined annotation cases have been handled when parsing the input files to retrieve different start and stop coordinates.
  • [x] The information of the different coordinates is stored in an attribute coordinates of gene objects. This attribute is a list of start and stop tuples.
  • [x] To keep the information of the joined coordinates, a new table has been added in the hdf5 call joinedCoordinates and stores the multiple start and stop of joined genes.
  • [x] Some pytest functions have been added to check the correctness of the implementation.
  • [x] Ensure proper handling of circular RGPs, addressing issues observed in the spot plot (refer to issue #124) and ProkSee output.

JeanMainguy avatar Mar 28 '24 17:03 JeanMainguy

Examples of joined annotations

Joined coordinates in annotations can originate from two main sources: frameshift mutations and contig edge overlaps.

The gene is annotated with a frameshift.

Example in GBFF file:

     gene            complement(4142570..4143665)
                     /gene="prfB"
                     /locus_tag="PA3701"
                     /db_xref="GeneID:880384"
     CDS             complement(join(4142570..4143593,4143595..4143665))
                     /gene="prfB"
                     /locus_tag="PA3701"
                     /ribosomal_slippage
                     /codon_start=1
                     /transl_table=11
                     /product="peptide chain release factor 1"
                     /protein_id="YP_003933612.1"
                     /db_xref="GeneID:880384"

Example in GFF file:

Sequence Source Type Start End _ Strand Phase Attributes
NC_002516.2 RefSeq gene 4142570 4143665 . - . ID=gene-PA3701
NC_002516.2 RefSeq CDS 4143595 4143665 . - 0 ID=cds-YP_003933612.1
NC_002516.2 RefSeq CDS 4142570 4143593 . - 1 ID=cds-YP_003933612.1

When a gene is annotated with a frameshift, it is represented by two CDS lines in the GFF file, each corresponding to a different piece of the gene.

The gene is overlapping the edge of a circular contig.

Nothing biological behind that just a bioinformatic issue...

Example in GBFF file:

     gene            join(1041974..1042567,1..1176)
                     /locus_tag="BKB96_RS00005"
                     /old_locus_tag="BKB96_00005"
     CDS             join(1041974..1042567,1..1176)
                     /locus_tag="BKB96_RS00005"
                     /old_locus_tag="BKB96_00005"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:NP_219502.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="hypothetical protein"
                     /protein_id="WP_100115054.1"

Another case where only one base of the gene is the last base of the circular contig

     gene            join(1038313,1..1016)
                     /locus_tag="CTL2C_RS00005"
     CDS             join(1038313,1..1016)
                     /locus_tag="CTL2C_RS00005"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:YP_001654092.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="porphobilinogen synthase"
                     /protein_id="WP_009873181.1"

Example in GFF file:

Sequence Source Type Start End _ Strand Phase Attributes
NZ_CP017736.1 RefSeq gene 1041974 1043743 . + . ID=gene-BKB96_RS00005
NZ_CP017736.1 Protein Homology CDS 1041974 1043743 . + 0 ID=cds-WP_100115054.1

In a GFF file, when a gene extends beyond the boundary of a contig, it means that its end position exceeds the size of the contig. For instance, in the provided example, the contig named 'NZ_CP017736.1' has a size of 1042567 base pairs, but the gene's end position is recorded as 1043743, which surpasses the contig's length. This indicates that the gene extends beyond the edge of the contig, effectively overlapping it.

JeanMainguy avatar Apr 02 '24 16:04 JeanMainguy

Test on few pangenomes:

species D.pigrum P.aeruginosa E.Coli
Before After Before After Before After
Genes 59999 59999 4 933 371 4 939 509 14 932 814 15 017 229
Genomes 32 32 802 802 3190 3190
Families 4123 4123 32 196 32 279 45 002 45 077
Edges 5742 5742 59 301 60 524 131 142 139 257
Persistent 1436 1436 5 181 5 028 3 137 3 135
Shell 477 477 5 982 6 286 7 250 7 234
Cloud 2210 2210 21 033 20 965 34 676 34 708
RGP 806 806 33 894 34 148 236 867 236 824
Spots 97 97 915 829 2 037 1 721
Modules 122 122 1 420 1422 2 219 2 219

The difference in RGP count comes from a difference in clustering, which affects partitioning.

jpjarnoux avatar May 27 '24 18:05 jpjarnoux

The following image shows the difference in RGP on the GCF_016904235.1_ASM1690423v1_CDS_2283 genome with proksee. cmp_proksee_RGP There is a deletion of RGP66 due to the change in partition of the ECK0501 gene family from shell to persistent. RGP66 RGPs 12 and 19 merge into RGP9 due to the change in partition of the ECK1494 gene family from persistent to shell. RGP9

jpjarnoux avatar Jun 03 '24 09:06 jpjarnoux