xpore
xpore copied to clipboard
Dataprep issue: empty data files
Hi - I'm having a problem running dataprep where the eventalign.index appears to run successfully but the data files do not eventalign.index output:
transcript_id,read_index,pos_start,pos_end ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,15,172,97378 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,27,97378,302018 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,31,302018,492171 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,22,492171,850364
data.index ouput:
idx,start,end
data.json:
N/A - is empty
data.log contains:
Total 0 genes. --- SUCCESSFULLY FINISHED ---
data.readcount contains:
idx,n_reads
I've run dataprep on your test dataset and it runs perfectly and based on other issues I've seen your comments on I'm gathering that there's possibly something wrong with either the fasta of gtf file I'm using, but I'm not 100% certain. Any help would be greatly appreciated. Thanks in advance
head of the eventalign.txt:
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 0 ATTAA 15 t 517 84.98 0.766 0.01096 ATTAA 85.30 2.46 -0.11 25541 25574
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 1 TTAAT 15 t 518 99.08 1.737 0.00266 TTAAT 94.22 3.04 1.33 25533 25541
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 2 TAATC 15 t 519 101.34 1.078 0.00598 TAATC 106.55 3.11 -1.39 25515 25533
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 3 AATCC 15 t 520 110.00 3.603 0.00465 AATCC 103.25 5.70 0.98 25501 25515
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 4 ATCCC 15 t 521 71.35 1.063 0.00498 ATCCC 70.32 2.30 0.37 25486 25501
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 4 ATCCC 15 t 522 72.25 1.337 0.00531 ATCCC 70.32 2.30 0.70 25470 25486
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 5 TCCCC 15 t 523 65.68 1.155 0.00365 TCCCC 64.94 2.07 0.30 25459 25470
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 6 CCCCT 15 t 524 62.27 1.499 0.01461 CCCCT 64.58 2.07 -0.93 25415 25459
head of gencode transcript fa:
>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC
TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA
TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG
CAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGG
head of gencode annotation gtf:
##description: evidence-based annotation of the human genome (GRCh38), version 39 (Ensembl 105)
##provider: GENCODE
##contact: [email protected]
##format: gtf
##date: 2021-09-02
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
Hi @acurry-hyde,
xpore dataprep
does not support the contig being in the following format:
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|
You will have to change the eventalign contig column to include only the transcript_id ENST00000457540.1
for xpore dataprep to run through.
Thanks!
Best wishes, Yuk Kei
Thanks for your response @yuukiiwa.
I modified the eventalign contig column as you suggested to include only the transcript_id and then re-ran dataprep but unfortunately it had the same output as before (i.e. all of the data files are empty except for headers). Is it that I should alter the gencode transcripts.fa file to also contain only the transcript_id (ENST) and then redo the alignment and eventalign steps?
Eventalign.txt with altered contig
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx
ENST00000457540.1 0 ATTAA 15 t 517 84.98 0.766 0.01096 ATTAA 85.30 2.46 -0.11 25541 25574
ENST00000457540.1 1 TTAAT 15 t 518 99.08 1.737 0.00266 TTAAT 94.22 3.04 1.33 25533 25541
ENST00000457540.1 2 TAATC 15 t 519 101.34 1.078 0.00598 TAATC 106.55 3.11 -1.39 25515 25533
ENST00000457540.1 3 AATCC 15 t 520 110.00 3.603 0.00465 AATCC 103.25 5.70 0.98 25501 25515
ENST00000457540.1 4 ATCCC 15 t 521 71.35 1.063 0.00498 ATCCC 70.32 2.30 0.37 25486 25501
ENST00000457540.1 4 ATCCC 15 t 522 72.25 1.337 0.00531 ATCCC 70.32 2.30 0.70 25470 25486
ENST00000457540.1 5 TCCCC 15 t 523 65.68 1.155 0.00365 TCCCC 64.94 2.07 0.30 25459 25470
ENST00000457540.1 6 CCCCT 15 t 524 62.27 1.499 0.01461 CCCCT 64.58 2.07 -0.93 25415 25459
ENST00000457540.1 6 CCCCT 15 t 525 62.98 1.279 0.00664 CCCCT 64.58 2.07 -0.65 25395 25415
The tail of the job log is below just in case it's at all relevant:
/home/z3360668/.venvs/xpore/lib/python3.8/site-packages/xpore-2.1-py3.8.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/z3360668/.venvs/xpore/lib/python3.8/site-packages/xpore-2.1-py3.8.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/z3360668/.venvs/xpore/lib/python3.8/site-packages/xpore-2.1-py3.8.egg/xpore/scripts/dataprep.py:72: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
chunk_split['line_length'] = np.array(lines)
Many thanks in advance for your help! Ashton
EDIT: the eventalign.index file appears as follows:
transcript_id,read_index,pos_start,pos_end
ENST00000457540.1,15,172,44098
ENST00000457540.1,27,44098,136628
ENST00000457540.1,31,136628,222663
ENST00000457540.1,22,222663,384497
ENST00000416931.1,12,384497,414504
ENST00000457540.1,18,414504,483561
ENST00000424587.7,1,483561,700948
ENST00000635159.1,8,700948,797833
ENST00000457540.1,21,797833,869968
I've noticed that when clicking to highlight the header 'columns' they appear to highlight as individual values (eg. transcript_id or read_index etc (see first screenshot below), where the delimiter is recognised as separating each header value), however the indexed events appear to not recognise the comma delimiter (see second screenshot below)... could this be contributing to the error?
![Screen Shot 2022-04-05 at 1 14 00 pm](https://user-images.githubusercontent.com/79489840/161671844-c80f04a0-a86a-42f9-ac7a-fafe7c32e0eb.png)
![Screen Shot 2022-04-05 at 1 13 46 pm](https://user-images.githubusercontent.com/79489840/161671805-24071ccb-cf4a-4a15-8cdb-ca8af90785ed.png)
I used my own data to run the xpore dataprep, the command is like below: xpore dataprep --eventalign /work/schroederrna/methylation/hbecpolya/hbecpolya_event.tsv --out_dir /work/schroederrna/methylation/hbecpolya/test_4
it takes more than 20h, I didn't get any output, the xpore still running, but the error report shows
Traceback (most recent call last): File "/home/dywang/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'read_index'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/dywang/.local/bin/xpore", line 8, in
can anyone help me with this, thanks a lot!!!
Hi @dywang0323,
Do you mind showing the first 10 lines of your hbecpolya_event.tsv
file, please?
head hbecpolya_event.tsv
Thanks!
Best wishes, Yuk Kei