gawn icon indicating copy to clipboard operation
gawn copied to clipboard

Correct mapping but strange gene model

Open soungalo opened this issue 3 years ago • 3 comments

Hello, I ran GAWN using A. thaliana reference transcripts on a genome assembly from the same species. I notice quite a few cases in which transcripts seem to map very well to the assembly, but the resulting gene model is very different from the reference gene model, especially in terms of CDS features. Here's an example. The transcript AT1G01190.2 has the following reference gene model:

1       araport11       mRNA    83045   84946   .       -       .       ID=transcript:AT1G01190.2;Parent=gene:AT1G01190;biotype=protein_coding;transcript_id=AT1G01190.2
1       araport11       exon    83045   83671   .       -       .       Parent=transcript:AT1G01190.2;Name=AT1G01190.2.exon2;constitutive=0;ensembl_end_phase=0;ensembl_phase=0;exon_id=AT1G01190.2.exon2;rank=2
1       araport11       CDS     83045   83671   .       -       0       ID=CDS:AT1G01190.2;Parent=transcript:AT1G01190.2;protein_id=AT1G01190.2
1       araport11       CDS     83884   84879   .       -       0       ID=CDS:AT1G01190.2;Parent=transcript:AT1G01190.2;protein_id=AT1G01190.2
1       araport11       exon    83884   84946   .       -       .       Parent=transcript:AT1G01190.2;Name=AT1G01190.2.exon1;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=AT1G01190.2.exon1;rank=1
1       araport11       five_prime_UTR  84880   84946   .       -       .       Parent=transcript:AT1G01190.2

whereas in GAWN output gff3, it looks like this:

1_RaGOO indexed_genome  mRNA    77996   79895   .       -       .       ID=AT1G01190.2.mrna1;Name=AT1G01190.2;Parent=AT1G01190.2.path1;Dir=sensecoverage=100.0;identity=99.7;matches=1685;mismatches=3;indels=2;unknowns=0
1_RaGOO indexed_genome  exon    78835   79895   99      -       .       ID=AT1G01190.2.mrna1.exon1;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 1 1063 +
1_RaGOO indexed_genome  exon    77996   78622   99      -       .       ID=AT1G01190.2.mrna1.exon2;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 1064 1690 +
1_RaGOO indexed_genome  CDS     79830   79893   93      -       0       ID=AT1G01190.2.mrna1.cds1;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 3 68 +

I extracted the transcript and protein sequences based on the GFF3 output. The transcript is highly similar to the reference transcript, with only 2 gaps present in the annotated assembly (you can see the alignment here). However, the resulting protein is very short with virtually no similarity to the reference protein. This is not surprising since the annotated CDS is so short.

As far as I understand, the GFF3 result arises directly from the GMAP analysis, so I'm not sure this is in fact a GAWN issue. I'm also not particularly clear on how GMAP predicts CDS features based on transcript sequences only. Can you help me understand the reason for the problem or suggest ways to better explore it or get around it? Do you think that the two gaps in the assembly can cause such a difference in the gene model prediction, or is it more likely a more general issue with GMAP? Thanks!

soungalo avatar Jul 13 '20 14:07 soungalo