gawn
gawn copied to clipboard
Correct mapping but strange gene model
Hello, I ran GAWN using A. thaliana reference transcripts on a genome assembly from the same species. I notice quite a few cases in which transcripts seem to map very well to the assembly, but the resulting gene model is very different from the reference gene model, especially in terms of CDS features. Here's an example. The transcript AT1G01190.2 has the following reference gene model:
1 araport11 mRNA 83045 84946 . - . ID=transcript:AT1G01190.2;Parent=gene:AT1G01190;biotype=protein_coding;transcript_id=AT1G01190.2
1 araport11 exon 83045 83671 . - . Parent=transcript:AT1G01190.2;Name=AT1G01190.2.exon2;constitutive=0;ensembl_end_phase=0;ensembl_phase=0;exon_id=AT1G01190.2.exon2;rank=2
1 araport11 CDS 83045 83671 . - 0 ID=CDS:AT1G01190.2;Parent=transcript:AT1G01190.2;protein_id=AT1G01190.2
1 araport11 CDS 83884 84879 . - 0 ID=CDS:AT1G01190.2;Parent=transcript:AT1G01190.2;protein_id=AT1G01190.2
1 araport11 exon 83884 84946 . - . Parent=transcript:AT1G01190.2;Name=AT1G01190.2.exon1;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=AT1G01190.2.exon1;rank=1
1 araport11 five_prime_UTR 84880 84946 . - . Parent=transcript:AT1G01190.2
whereas in GAWN output gff3, it looks like this:
1_RaGOO indexed_genome mRNA 77996 79895 . - . ID=AT1G01190.2.mrna1;Name=AT1G01190.2;Parent=AT1G01190.2.path1;Dir=sensecoverage=100.0;identity=99.7;matches=1685;mismatches=3;indels=2;unknowns=0
1_RaGOO indexed_genome exon 78835 79895 99 - . ID=AT1G01190.2.mrna1.exon1;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 1 1063 +
1_RaGOO indexed_genome exon 77996 78622 99 - . ID=AT1G01190.2.mrna1.exon2;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 1064 1690 +
1_RaGOO indexed_genome CDS 79830 79893 93 - 0 ID=AT1G01190.2.mrna1.cds1;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 3 68 +
I extracted the transcript and protein sequences based on the GFF3 output. The transcript is highly similar to the reference transcript, with only 2 gaps present in the annotated assembly (you can see the alignment here). However, the resulting protein is very short with virtually no similarity to the reference protein. This is not surprising since the annotated CDS is so short.
As far as I understand, the GFF3 result arises directly from the GMAP analysis, so I'm not sure this is in fact a GAWN issue. I'm also not particularly clear on how GMAP predicts CDS features based on transcript sequences only. Can you help me understand the reason for the problem or suggest ways to better explore it or get around it? Do you think that the two gaps in the assembly can cause such a difference in the gene model prediction, or is it more likely a more general issue with GMAP? Thanks!