xpore
xpore copied to clipboard
readGTF split issue
Hi,
Thanks a lot for developing xpore!
I'm working on some non-model organisms, and testing xpore with our own pipeline. The transcripts were assembled with stringtie, and xpore was installed by conda install
, the package info is:
# packages in environment at /home/ubuntu/Tools/miniconda3/envs/xpore:
#
# Name Version Build Channel
xpore 2.0 pyh5e36f6f_0 bioconda
and we found stringtie output gtf file was not compatible with xpore. The trackback was like below:
Traceback (most recent call last):
File "/home/ubuntu/Tools/miniconda3/envs/xpore/bin/xpore", line 10, in <module>
sys.exit(main())
File "/home/ubuntu/Tools/miniconda3/envs/xpore/lib/python3.9/site-packages/xpore/scripts/xpore.py", line 67, in main
options.func(options)
File "/home/ubuntu/Tools/miniconda3/envs/xpore/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 692, in dataprep
gtf_dict = readGTF(gtf_path_or_url)
File "/home/ubuntu/Tools/miniconda3/envs/xpore/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 184, in readGTF
tx_id=ln[-1].split('; transcript_id "')[1].split('";')[0]
By checking the readGTF
function, I found this issue may due to a "non-standard" split of gtf attributes (column 9) in readGTF function:
https://github.com/GoekeLab/xpore/blob/8722c06314d1fec90dde85347186612144fab6e6/xpore/scripts/dataprep.py#L184-L185
The stringtie gtf has transcript_id
as the first attribute, but gene_id
as the second, which is not the same as human hg38 gtf file.
example stringtie.gtf:
Gm01 StringTie transcript 120160 131940 . + . transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; gene_name "MSTRG.3"; xloc "XLOC_000001"; cmp_ref "Glyma.01G000600.2.Wm82.a4.v1"; class_code "="; cmp_ref_gene "Glyma.01G000600.Wm82.a4.v1"; tss_id "TSS1";
Gm01 StringTie exon 120160 122559 . + . transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; exon_number "1";
Gm01 StringTie exon 131469 131940 . + . transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; exon_number "2";
Hope this helps, thanks! obenno
Hi obenno,
Thank you for reporting this bug! The readGTF() function was written based on the ensembl
gtf
files, which have transcript_id
after the gene_id
. I will update the code accordingly and make it more flexible for other kinds of gtf
files. Will update you once the code is updated!
Best wishes, Yuk Kei
Hi @yuukiiwa,
Thanks a lot for your prompt response.
If you don't mind, I could do a PR ^^
Best regards, obenno
Hi obenno,
Thank you so much for offering to submit a PR! @ploy-np will update the dev
branch. Once the dev
branch is updated (we will update you on that to avoid merging conflict), you can then submit a PR to the dev
branch.
Thank you!!
Best wishes, Yuk Kei
Hi @obenno,
The dev
branch is updated! It will be great if you can submit a PR to the dev
branch.
Thank you!!
Best wishes, Yuk Kei
Hi @yuukiiwa,
Thank you for the updates!
The PR was submitted as https://github.com/GoekeLab/xpore/pull/87. The code was tested on both the demo data and my own data. By the way, it seems that the fasta and gtf in the your demo.tar.gz
are not valid. I used GRCh38 gtf and cDNA fasta (release 91) downloaded from ensembl's FTP server.
Best regards, obenno
Hi obenno,
Thank you for the quick update! We will review and likely merge in the PR by the end of this week!
The zenodo demo.tar.gz
link is updated but not the link on the readthedocs yet.
Best wishes, Yuk Kei
Hi @yuukiiwa,
Thank you and please feel free to close this issue.