xpore readGTF split issue

Hi,

Thanks a lot for developing xpore!

I'm working on some non-model organisms, and testing xpore with our own pipeline. The transcripts were assembled with stringtie, and xpore was installed by conda install, the package info is:

# packages in environment at /home/ubuntu/Tools/miniconda3/envs/xpore:
#
# Name                    Version                   Build  Channel
xpore                     2.0                pyh5e36f6f_0    bioconda

and we found stringtie output gtf file was not compatible with xpore. The trackback was like below:

Traceback (most recent call last):
  File "/home/ubuntu/Tools/miniconda3/envs/xpore/bin/xpore", line 10, in <module>
    sys.exit(main())
  File "/home/ubuntu/Tools/miniconda3/envs/xpore/lib/python3.9/site-packages/xpore/scripts/xpore.py", line 67, in main
    options.func(options)
  File "/home/ubuntu/Tools/miniconda3/envs/xpore/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 692, in dataprep
    gtf_dict = readGTF(gtf_path_or_url)
  File "/home/ubuntu/Tools/miniconda3/envs/xpore/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 184, in readGTF
    tx_id=ln[-1].split('; transcript_id "')[1].split('";')[0]

By checking the readGTF function, I found this issue may due to a "non-standard" split of gtf attributes (column 9) in readGTF function: https://github.com/GoekeLab/xpore/blob/8722c06314d1fec90dde85347186612144fab6e6/xpore/scripts/dataprep.py#L184-L185

The stringtie gtf has transcript_id as the first attribute, but gene_id as the second, which is not the same as human hg38 gtf file.

example stringtie.gtf:

Gm01	StringTie	transcript	120160	131940	.	+	.	transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; gene_name "MSTRG.3"; xloc "XLOC_000001"; cmp_ref "Glyma.01G000600.2.Wm82.a4.v1"; class_code "="; cmp_ref_gene "Glyma.01G000600.Wm82.a4.v1"; tss_id "TSS1";
Gm01	StringTie	exon	120160	122559	.	+	.	transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; exon_number "1";
Gm01	StringTie	exon	131469	131940	.	+	.	transcript_id "TCONS_00000001"; gene_id "XLOC_000001"; exon_number "2";

Hope this helps, thanks! obenno

Aug 19 '21 06:08 obenno

Hi obenno,

Thank you for reporting this bug! The readGTF() function was written based on the ensembl gtf files, which have transcript_id after the gene_id. I will update the code accordingly and make it more flexible for other kinds of gtf files. Will update you once the code is updated!

Best wishes, Yuk Kei

Aug 20 '21 05:08 yuukiiwa

Hi @yuukiiwa,

Thanks a lot for your prompt response.

If you don't mind, I could do a PR ^^

Best regards, obenno

Aug 21 '21 06:08 obenno

Hi obenno,

Thank you so much for offering to submit a PR! @ploy-np will update the dev branch. Once the dev branch is updated (we will update you on that to avoid merging conflict), you can then submit a PR to the dev branch. Thank you!!

Best wishes, Yuk Kei

Aug 24 '21 04:08 yuukiiwa

Hi @obenno,

The dev branch is updated! It will be great if you can submit a PR to the dev branch. Thank you!!

Best wishes, Yuk Kei

Aug 24 '21 06:08 yuukiiwa

Hi @yuukiiwa,

Thank you for the updates!

The PR was submitted as https://github.com/GoekeLab/xpore/pull/87. The code was tested on both the demo data and my own data. By the way, it seems that the fasta and gtf in the your demo.tar.gz are not valid. I used GRCh38 gtf and cDNA fasta (release 91) downloaded from ensembl's FTP server.

Best regards, obenno

Aug 24 '21 11:08 obenno

Hi obenno,

Thank you for the quick update! We will review and likely merge in the PR by the end of this week! The zenodo demo.tar.gz link is updated but not the link on the readthedocs yet.

Best wishes, Yuk Kei

Aug 24 '21 12:08 yuukiiwa

Hi @yuukiiwa,

Thank you and please feel free to close this issue.

Aug 27 '21 05:08 obenno

xpore xpore copied to clipboard

readGTF split issue

xpore
xpore copied to clipboard