SUPPA
SUPPA copied to clipboard
generateEvents option returning empty .ioe files
For input:
- I have got a GFF file from PlasmoDB. "PlasmoDB-66_PvivaxP01.gff"
- I converted it to GTF file using agat
agat_convert_sp_gff2gtf.pl --gff ../PlasmoDB-66_PvivaxP01.gff -o PlasmoDB-66_PvivaxP01.gtf
-
suppa.py generateEvents -i ../PlasmoDB-66_PvivaxP01.gtf -o AlternativeSplicing/localAS/local -f ioe -e {SE,SS,MX,RI,FL}
Output: empty .ioe files eg.
seqname gene_id event_id alternative_transcripts total_transcripts
GTF file:
PvP01_API_v2 VEuPathDB ncRNA_gene 9 63 . + . gene_id "PVP01_API00100"; ID "PVP01_API00100"; description "tRNA Threonine"; ebi_biotype "tRNA";
PvP01_API_v2 VEuPathDB tRNA 9 63 . + . gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "PVP01_API00100.1"; Parent "PVP01_API00100"; description "tRNA Threonine"; gene_ebi_biotype "tRNA";
PvP01_API_v2 VEuPathDB exon 9 63 . + . gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "exon_PVP01_API00100.1-E1"; Parent "PVP01_API00100.1";
PvP01_API_v2 VEuPathDB protein_coding_gene 90 704 . + . gene_id "PVP01_API00200"; ID "PVP01_API00200"; Name "RPS4"; description "apicoplast ribosomal protein S4, putative"; ebi_biotype "protein_coding";
PvP01_API_v2 VEuPathDB mRNA 90 704 . + . gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1"; Parent "PVP01_API00200"; description "apicoplast ribosomal protein S4, putative"; gene_ebi_biotype "protein_coding";
PvP01_API_v2 VEuPathDB exon 90 704 . + . gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "exon_PVP01_API00200.1-E1"; Parent "PVP01_API00200.1";
PvP01_API_v2 VEuPathDB CDS 90 704 . + 0 gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1-p1-CDS1"; Parent "PVP01_API00200.1"; protein_source_id "PVP01_API00200.1-p1";
PvP01_API_v2 VEuPathDB ncRNA_gene 716 787 . + . gene_id "PVP01_API00300"; ID "PVP01_API00300"; description "tRNA Histidine"; ebi_biotype "tRNA";
Hi Arnav,
could you please send me the GTF here: @.*** ?
In principle the format looks fine. But it was generated from a GFF, which has a single ID in column 9. So I am wondering whether your GTF contains multiple transcript IDs associated to the same Gene ID. Without that, SUPPA cannot calculate the AS events.
Thanks
Eduardo
On Wed, 3 Jan 2024 at 14:26, Arnav Bharti @.***> wrote:
For input:
- I have got a GFF file from PlasmoDB. "PlasmoDB-66_PvivaxP01.gff"
- I converted it to GTF file using agat agat_convert_sp_gff2gtf.pl --gff ../PlasmoDB-66_PvivaxP01.gff -o PlasmoDB-66_PvivaxP01.gtf
- suppa.py generateEvents -i ../PlasmoDB-66_PvivaxP01.gtf -o AlternativeSplicing/localAS/local -f ioe -e {SE,SS,MX,RI,FL}
Output: empty .ioe files eg.
seqname gene_id event_id alternative_transcripts total_transcripts
GTF file:
PvP01_API_v2 VEuPathDB ncRNA_gene 9 63 . + . gene_id "PVP01_API00100"; ID "PVP01_API00100"; description "tRNA Threonine"; ebi_biotype "tRNA"; PvP01_API_v2 VEuPathDB tRNA 9 63 . + . gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "PVP01_API00100.1"; Parent "PVP01_API00100"; description "tRNA Threonine"; gene_ebi_biotype "tRNA"; PvP01_API_v2 VEuPathDB exon 9 63 . + . gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "exon_PVP01_API00100.1-E1"; Parent "PVP01_API00100.1"; PvP01_API_v2 VEuPathDB protein_coding_gene 90 704 . + . gene_id "PVP01_API00200"; ID "PVP01_API00200"; Name "RPS4"; description "apicoplast ribosomal protein S4, putative"; ebi_biotype "protein_coding"; PvP01_API_v2 VEuPathDB mRNA 90 704 . + . gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1"; Parent "PVP01_API00200"; description "apicoplast ribosomal protein S4, putative"; gene_ebi_biotype "protein_coding"; PvP01_API_v2 VEuPathDB exon 90 704 . + . gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "exon_PVP01_API00200.1-E1"; Parent "PVP01_API00200.1"; PvP01_API_v2 VEuPathDB CDS 90 704 . + 0 gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1-p1-CDS1"; Parent "PVP01_API00200.1"; protein_source_id "PVP01_API00200.1-p1"; PvP01_API_v2 VEuPathDB ncRNA_gene 716 787 . + . gene_id "PVP01_API00300"; ID "PVP01_API00300"; description "tRNA Histidine"; ebi_biotype "tRNA";
— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/178, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB4MYI243INRJNOFAHLYMTFU7AVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DGMJZGI2TMOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I managed to get the GTF file (instead of converting from GFF).
A few lines were like this: NC_009911.1 RefSeq exon 441005 441169 . - . gene_id "PVX_110843"; transcript_id "XR_003001228.1"; db_xref "GeneID:5471288"; locus_tag "PVX_110843"; note "LSU 5.8S rRNA; O-type"; orig_transcript_id "gnl|WGS:AAKM|mrna.PVX_110843-RA"; product "5.8S ribosomal RNA"; transcript_biotype "rRNA"; exon_number "1";
where the ;
in LSU 5.8S rRNA; O-type
was split into two causing an IndexError while parsing. I was wondering if replacing ;
with ;
(removing space) where it occurs inside quotation marks would affect the output. 'Cause by doing this IndexError goes away.
BUT the .ioe file generated is still empty.
att_dict = dict(map(lambda x: (x[0], x[1].strip('"')), attributes))
<- this line is causing the index error
Thanks for sending the GTF file.
The issue is what I suspected. The GTF does not contain multiple transcripts per gene.
Using only the "exon" lines (used by SUPPA), I counted how many genes "gene_id" contain more than 1 "transcript_id", but there are none:
awk '$3=="exon"' genomic_data.gtf | cut -f9 | cut -f1-4 -d" " | sort -u | cut -f1,2 -d" " | sort | uniq -c | awk '$1>1' | wc -l
0
In these cases, you can use the option -p | --pool-genes when running generateEvents This option will first cluster the transcripts into genes according to exon overlaps in the same strand. It uses a simple definition of gene to be transcripts with exons that overlap on the same strand and share at least one splice-site.
This will create new gene IDs based on the gene_id IDs that are overlapping in genomic loci.
Please try this to see whether there are overlapping transcripts in the annotated genomic loci.
I hope this helps
Best
Eduardo
On Fri, 5 Jan 2024 at 03:04, Arnav Bharti @.***> wrote:
I managed to get the GTF file (instead of converting from GFF).
A few lines were like this: NC_009911.1 RefSeq exon 441005 441169 . - . gene_id "PVX_110843"; transcript_id "XR_003001228.1"; db_xref "GeneID:5471288"; locus_tag "PVX_110843"; note "LSU 5.8S rRNA; O-type"; orig_transcript_id "gnl|WGS:AAKM|mrna.PVX_110843-RA"; product "5.8S ribosomal RNA"; transcript_biotype "rRNA"; exon_number "1"; where the ; in LSU 5.8S rRNA; O-type was split into two causing an IndexError while parsing. I was wondering if replacing ; with ; (removing space) where it occurs inside quotation marks would affect the output. 'Cause by doing this IndexError goes away.
BUT the .ioe file generated is empty.
genomic_data.gtf.zip https://github.com/comprna/SUPPA/files/13832434/genomic_data.gtf.zip
att_dict = dict(map(lambda x: (x[0], x[1].strip('"')), attributes)) <- this line is causing the index error
— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/178#issuecomment-1877351898, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB6VT4PGWOCB4PDTY43YM3HIZAVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXGM2TCOBZHA . You are receiving this because you commented.Message ID: @.***>
Even with pool genes flag the output is still empty.
$ suppa.py generateEvents -i ~/Downloads/genomic.gtf -o local -f ioe -e {SE,SS,MX,RI,FL} --pool-genes
INFO:eventGenerator:Reading input data.
INFO:eventGenerator:Pooling genes
INFO:eventGenerator:Calculating events
INFO:eventGenerator:Done
$ cat local*.ioe > biglocal.ioe
$ cat biglocal.ioe
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
Hi Arnav,
this means that the GTF annotation that you're using does not contain any alternative transcripts in the annotated genes, or that the variation of those transcripts cannot be expressed in terms of splicing events.
I checked, and there are no alternative transcripts; that is, the annotation is 1-gene:1-transcript. SUPPA needs an annotation with overlapping transcripts to calculate the events.
You could either enhance the annotation with novel transcripts using RNA-seq reads with a tool like StringTie (or StringTie2 or FLAIR if you have long reads), or directly use RNA-seq reads with e.g. rMATs or MAJIQ to directly calculate alternative splicing events.
Another option is if Ensembl has other annotations that may be more transcript rich (https://protists.ensembl.org/info/data/ftp/index.html)
I hope this helps
Eduardo
On Fri, 5 Jan 2024 at 15:55, Arnav Bharti @.***> wrote:
Even with pool genes flag the output is still empty.
$ suppa.py generateEvents -i ~/Downloads/genomic.gtf -o local -f ioe -e {SE,SS,MX,RI,FL} --pool-genes
INFO:eventGenerator:Reading input data. INFO:eventGenerator:Pooling genes INFO:eventGenerator:Calculating events INFO:eventGenerator:Done
$ cat local*.ioe > biglocal.ioe $ cat biglocal.ioe
seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts
— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/178#issuecomment-1878123857, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB6BPP6N7LEW5CVLSKTYM6BULAVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYGEZDGOBVG4 . You are receiving this because you commented.Message ID: @.***>