SUPPA icon indicating copy to clipboard operation
SUPPA copied to clipboard

generateEvents option returning empty .ioe files

Open ArnavBharti opened this issue 1 year ago • 5 comments

For input:

  1. I have got a GFF file from PlasmoDB. "PlasmoDB-66_PvivaxP01.gff"
  2. I converted it to GTF file using agat agat_convert_sp_gff2gtf.pl --gff ../PlasmoDB-66_PvivaxP01.gff -o PlasmoDB-66_PvivaxP01.gtf
  3. suppa.py generateEvents -i ../PlasmoDB-66_PvivaxP01.gtf -o AlternativeSplicing/localAS/local -f ioe -e {SE,SS,MX,RI,FL}

Output: empty .ioe files eg.

seqname	gene_id	event_id	alternative_transcripts	total_transcripts

GTF file:

PvP01_API_v2	VEuPathDB	ncRNA_gene	9	63	.	+	.	gene_id "PVP01_API00100"; ID "PVP01_API00100"; description "tRNA Threonine"; ebi_biotype "tRNA";
PvP01_API_v2	VEuPathDB	tRNA	9	63	.	+	.	gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "PVP01_API00100.1"; Parent "PVP01_API00100"; description "tRNA Threonine"; gene_ebi_biotype "tRNA";
PvP01_API_v2	VEuPathDB	exon	9	63	.	+	.	gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "exon_PVP01_API00100.1-E1"; Parent "PVP01_API00100.1";
PvP01_API_v2	VEuPathDB	protein_coding_gene	90	704	.	+	.	gene_id "PVP01_API00200"; ID "PVP01_API00200"; Name "RPS4"; description "apicoplast ribosomal protein S4, putative"; ebi_biotype "protein_coding";
PvP01_API_v2	VEuPathDB	mRNA	90	704	.	+	.	gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1"; Parent "PVP01_API00200"; description "apicoplast ribosomal protein S4, putative"; gene_ebi_biotype "protein_coding";
PvP01_API_v2	VEuPathDB	exon	90	704	.	+	.	gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "exon_PVP01_API00200.1-E1"; Parent "PVP01_API00200.1";
PvP01_API_v2	VEuPathDB	CDS	90	704	.	+	0	gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1-p1-CDS1"; Parent "PVP01_API00200.1"; protein_source_id "PVP01_API00200.1-p1";
PvP01_API_v2	VEuPathDB	ncRNA_gene	716	787	.	+	.	gene_id "PVP01_API00300"; ID "PVP01_API00300"; description "tRNA Histidine"; ebi_biotype "tRNA";

ArnavBharti avatar Jan 03 '24 03:01 ArnavBharti

Hi Arnav,

could you please send me the GTF here: @.*** ?

In principle the format looks fine. But it was generated from a GFF, which has a single ID in column 9. So I am wondering whether your GTF contains multiple transcript IDs associated to the same Gene ID. Without that, SUPPA cannot calculate the AS events.

Thanks

Eduardo

On Wed, 3 Jan 2024 at 14:26, Arnav Bharti @.***> wrote:

For input:

  1. I have got a GFF file from PlasmoDB. "PlasmoDB-66_PvivaxP01.gff"
  2. I converted it to GTF file using agat agat_convert_sp_gff2gtf.pl --gff ../PlasmoDB-66_PvivaxP01.gff -o PlasmoDB-66_PvivaxP01.gtf
  3. suppa.py generateEvents -i ../PlasmoDB-66_PvivaxP01.gtf -o AlternativeSplicing/localAS/local -f ioe -e {SE,SS,MX,RI,FL}

Output: empty .ioe files eg.

seqname gene_id event_id alternative_transcripts total_transcripts

GTF file:

PvP01_API_v2 VEuPathDB ncRNA_gene 9 63 . + . gene_id "PVP01_API00100"; ID "PVP01_API00100"; description "tRNA Threonine"; ebi_biotype "tRNA"; PvP01_API_v2 VEuPathDB tRNA 9 63 . + . gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "PVP01_API00100.1"; Parent "PVP01_API00100"; description "tRNA Threonine"; gene_ebi_biotype "tRNA"; PvP01_API_v2 VEuPathDB exon 9 63 . + . gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "exon_PVP01_API00100.1-E1"; Parent "PVP01_API00100.1"; PvP01_API_v2 VEuPathDB protein_coding_gene 90 704 . + . gene_id "PVP01_API00200"; ID "PVP01_API00200"; Name "RPS4"; description "apicoplast ribosomal protein S4, putative"; ebi_biotype "protein_coding"; PvP01_API_v2 VEuPathDB mRNA 90 704 . + . gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1"; Parent "PVP01_API00200"; description "apicoplast ribosomal protein S4, putative"; gene_ebi_biotype "protein_coding"; PvP01_API_v2 VEuPathDB exon 90 704 . + . gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "exon_PVP01_API00200.1-E1"; Parent "PVP01_API00200.1"; PvP01_API_v2 VEuPathDB CDS 90 704 . + 0 gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1-p1-CDS1"; Parent "PVP01_API00200.1"; protein_source_id "PVP01_API00200.1-p1"; PvP01_API_v2 VEuPathDB ncRNA_gene 716 787 . + . gene_id "PVP01_API00300"; ID "PVP01_API00300"; description "tRNA Histidine"; ebi_biotype "tRNA";

— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/178, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB4MYI243INRJNOFAHLYMTFU7AVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DGMJZGI2TMOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

EduEyras avatar Jan 04 '24 04:01 EduEyras

I managed to get the GTF file (instead of converting from GFF).

A few lines were like this: NC_009911.1 RefSeq exon 441005 441169 . - . gene_id "PVX_110843"; transcript_id "XR_003001228.1"; db_xref "GeneID:5471288"; locus_tag "PVX_110843"; note "LSU 5.8S rRNA; O-type"; orig_transcript_id "gnl|WGS:AAKM|mrna.PVX_110843-RA"; product "5.8S ribosomal RNA"; transcript_biotype "rRNA"; exon_number "1"; where the ; in LSU 5.8S rRNA; O-type was split into two causing an IndexError while parsing. I was wondering if replacing ; with ; (removing space) where it occurs inside quotation marks would affect the output. 'Cause by doing this IndexError goes away.

BUT the .ioe file generated is still empty.

genomic_data.gtf.zip

att_dict = dict(map(lambda x: (x[0], x[1].strip('"')), attributes)) <- this line is causing the index error

ArnavBharti avatar Jan 04 '24 16:01 ArnavBharti

Thanks for sending the GTF file.

The issue is what I suspected. The GTF does not contain multiple transcripts per gene.

Using only the "exon" lines (used by SUPPA), I counted how many genes "gene_id" contain more than 1 "transcript_id", but there are none:

awk '$3=="exon"' genomic_data.gtf | cut -f9 | cut -f1-4 -d" " | sort -u | cut -f1,2 -d" " | sort | uniq -c | awk '$1>1' | wc -l

   0

In these cases, you can use the option -p | --pool-genes when running generateEvents This option will first cluster the transcripts into genes according to exon overlaps in the same strand. It uses a simple definition of gene to be transcripts with exons that overlap on the same strand and share at least one splice-site.

This will create new gene IDs based on the gene_id IDs that are overlapping in genomic loci.

Please try this to see whether there are overlapping transcripts in the annotated genomic loci.

I hope this helps

Best

Eduardo

On Fri, 5 Jan 2024 at 03:04, Arnav Bharti @.***> wrote:

I managed to get the GTF file (instead of converting from GFF).

A few lines were like this: NC_009911.1 RefSeq exon 441005 441169 . - . gene_id "PVX_110843"; transcript_id "XR_003001228.1"; db_xref "GeneID:5471288"; locus_tag "PVX_110843"; note "LSU 5.8S rRNA; O-type"; orig_transcript_id "gnl|WGS:AAKM|mrna.PVX_110843-RA"; product "5.8S ribosomal RNA"; transcript_biotype "rRNA"; exon_number "1"; where the ; in LSU 5.8S rRNA; O-type was split into two causing an IndexError while parsing. I was wondering if replacing ; with ; (removing space) where it occurs inside quotation marks would affect the output. 'Cause by doing this IndexError goes away.

BUT the .ioe file generated is empty.

genomic_data.gtf.zip https://github.com/comprna/SUPPA/files/13832434/genomic_data.gtf.zip

att_dict = dict(map(lambda x: (x[0], x[1].strip('"')), attributes)) <- this line is causing the index error

— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/178#issuecomment-1877351898, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB6VT4PGWOCB4PDTY43YM3HIZAVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXGM2TCOBZHA . You are receiving this because you commented.Message ID: @.***>

EduEyras avatar Jan 04 '24 23:01 EduEyras

Even with pool genes flag the output is still empty.

$ suppa.py generateEvents -i ~/Downloads/genomic.gtf -o local -f ioe -e {SE,SS,MX,RI,FL} --pool-genes

INFO:eventGenerator:Reading input data.
INFO:eventGenerator:Pooling genes
INFO:eventGenerator:Calculating events
INFO:eventGenerator:Done

$ cat local*.ioe > biglocal.ioe $ cat biglocal.ioe

seqname	gene_id	event_id	alternative_transcripts	total_transcripts
seqname	gene_id	event_id	alternative_transcripts	total_transcripts
seqname	gene_id	event_id	alternative_transcripts	total_transcripts
seqname	gene_id	event_id	alternative_transcripts	total_transcripts
seqname	gene_id	event_id	alternative_transcripts	total_transcripts
seqname	gene_id	event_id	alternative_transcripts	total_transcripts
seqname	gene_id	event_id	alternative_transcripts	total_transcripts

ArnavBharti avatar Jan 05 '24 04:01 ArnavBharti

Hi Arnav,

this means that the GTF annotation that you're using does not contain any alternative transcripts in the annotated genes, or that the variation of those transcripts cannot be expressed in terms of splicing events.

I checked, and there are no alternative transcripts; that is, the annotation is 1-gene:1-transcript. SUPPA needs an annotation with overlapping transcripts to calculate the events.

You could either enhance the annotation with novel transcripts using RNA-seq reads with a tool like StringTie (or StringTie2 or FLAIR if you have long reads), or directly use RNA-seq reads with e.g. rMATs or MAJIQ to directly calculate alternative splicing events.

Another option is if Ensembl has other annotations that may be more transcript rich (https://protists.ensembl.org/info/data/ftp/index.html)

I hope this helps

Eduardo

On Fri, 5 Jan 2024 at 15:55, Arnav Bharti @.***> wrote:

Even with pool genes flag the output is still empty.

$ suppa.py generateEvents -i ~/Downloads/genomic.gtf -o local -f ioe -e {SE,SS,MX,RI,FL} --pool-genes

INFO:eventGenerator:Reading input data. INFO:eventGenerator:Pooling genes INFO:eventGenerator:Calculating events INFO:eventGenerator:Done

$ cat local*.ioe > biglocal.ioe $ cat biglocal.ioe

seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts

— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/178#issuecomment-1878123857, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB6BPP6N7LEW5CVLSKTYM6BULAVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYGEZDGOBVG4 . You are receiving this because you commented.Message ID: @.***>

EduEyras avatar Jan 05 '24 07:01 EduEyras