xpore icon indicating copy to clipboard operation
xpore copied to clipboard

Dataprep issue: empty data files

Open acurry-hyde opened this issue 2 years ago • 4 comments

Hi - I'm having a problem running dataprep where the eventalign.index appears to run successfully but the data files do not eventalign.index output:

transcript_id,read_index,pos_start,pos_end ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,15,172,97378 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,27,97378,302018 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,31,302018,492171 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,22,492171,850364

data.index ouput:

idx,start,end

data.json:

N/A - is empty

data.log contains:

Total 0 genes. --- SUCCESSFULLY FINISHED ---

data.readcount contains:

idx,n_reads

I've run dataprep on your test dataset and it runs perfectly and based on other issues I've seen your comments on I'm gathering that there's possibly something wrong with either the fasta of gtf file I'm using, but I'm not 100% certain. Any help would be greatly appreciated. Thanks in advance

head of the eventalign.txt:

contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|	0	ATTAA	15	t	517	84.98	0.766	0.01096	ATTAA	85.30	2.46	-0.11	25541	25574
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|	1	TTAAT	15	t	518	99.08	1.737	0.00266	TTAAT	94.22	3.04	1.33	25533	25541
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|	2	TAATC	15	t	519	101.34	1.078	0.00598	TAATC	106.55	3.11	-1.39	25515	25533
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|	3	AATCC	15	t	520	110.00	3.603	0.00465	AATCC	103.25	5.70	0.98	25501	25515
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|	4	ATCCC	15	t	521	71.35	1.063	0.00498	ATCCC	70.32	2.30	0.37	25486	25501
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|	4	ATCCC	15	t	522	72.25	1.337	0.00531	ATCCC	70.32	2.30	0.70	25470	25486
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|	5	TCCCC	15	t	523	65.68	1.155	0.00365	TCCCC	64.94	2.07	0.30	25459	25470
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|	6	CCCCT	15	t	524	62.27	1.499	0.01461	CCCCT	64.58	2.07	-0.93	25415	25459

head of gencode transcript fa:

>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC
TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA
TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG
CAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGG

head of gencode annotation gtf:

##description: evidence-based annotation of the human genome (GRCh38), version 39 (Ensembl 105)
##provider: GENCODE
##contact: [email protected]
##format: gtf
##date: 2021-09-02
chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	12613	12721	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	13221	14409	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

acurry-hyde avatar Mar 18 '22 08:03 acurry-hyde

Hi @acurry-hyde,

xpore dataprep does not support the contig being in the following format:

ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|

You will have to change the eventalign contig column to include only the transcript_id ENST00000457540.1 for xpore dataprep to run through.

Thanks!

Best wishes, Yuk Kei

yuukiiwa avatar Mar 28 '22 05:03 yuukiiwa

Thanks for your response @yuukiiwa.

I modified the eventalign contig column as you suggested to include only the transcript_id and then re-ran dataprep but unfortunately it had the same output as before (i.e. all of the data files are empty except for headers). Is it that I should alter the gencode transcripts.fa file to also contain only the transcript_id (ENST) and then redo the alignment and eventalign steps?

Eventalign.txt with altered contig

contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx
ENST00000457540.1	0	ATTAA	15	t	517	84.98	0.766	0.01096	ATTAA	85.30	2.46	-0.11	25541	25574
ENST00000457540.1	1	TTAAT	15	t	518	99.08	1.737	0.00266	TTAAT	94.22	3.04	1.33	25533	25541
ENST00000457540.1	2	TAATC	15	t	519	101.34	1.078	0.00598	TAATC	106.55	3.11	-1.39	25515	25533
ENST00000457540.1	3	AATCC	15	t	520	110.00	3.603	0.00465	AATCC	103.25	5.70	0.98	25501	25515
ENST00000457540.1	4	ATCCC	15	t	521	71.35	1.063	0.00498	ATCCC	70.32	2.30	0.37	25486	25501
ENST00000457540.1	4	ATCCC	15	t	522	72.25	1.337	0.00531	ATCCC	70.32	2.30	0.70	25470	25486
ENST00000457540.1	5	TCCCC	15	t	523	65.68	1.155	0.00365	TCCCC	64.94	2.07	0.30	25459	25470
ENST00000457540.1	6	CCCCT	15	t	524	62.27	1.499	0.01461	CCCCT	64.58	2.07	-0.93	25415	25459
ENST00000457540.1	6	CCCCT	15	t	525	62.98	1.279	0.00664	CCCCT	64.58	2.07	-0.65	25395	25415

The tail of the job log is below just in case it's at all relevant:

/home/z3360668/.venvs/xpore/lib/python3.8/site-packages/xpore-2.1-py3.8.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/z3360668/.venvs/xpore/lib/python3.8/site-packages/xpore-2.1-py3.8.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/z3360668/.venvs/xpore/lib/python3.8/site-packages/xpore-2.1-py3.8.egg/xpore/scripts/dataprep.py:72: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk_split['line_length'] = np.array(lines)

Many thanks in advance for your help! Ashton

EDIT: the eventalign.index file appears as follows:

transcript_id,read_index,pos_start,pos_end
ENST00000457540.1,15,172,44098
ENST00000457540.1,27,44098,136628
ENST00000457540.1,31,136628,222663
ENST00000457540.1,22,222663,384497
ENST00000416931.1,12,384497,414504
ENST00000457540.1,18,414504,483561
ENST00000424587.7,1,483561,700948
ENST00000635159.1,8,700948,797833
ENST00000457540.1,21,797833,869968

I've noticed that when clicking to highlight the header 'columns' they appear to highlight as individual values (eg. transcript_id or read_index etc (see first screenshot below), where the delimiter is recognised as separating each header value), however the indexed events appear to not recognise the comma delimiter (see second screenshot below)... could this be contributing to the error?

Screen Shot 2022-04-05 at 1 14 00 pm Screen Shot 2022-04-05 at 1 13 46 pm

acurry-hyde avatar Apr 05 '22 03:04 acurry-hyde

I used my own data to run the xpore dataprep, the command is like below: xpore dataprep --eventalign /work/schroederrna/methylation/hbecpolya/hbecpolya_event.tsv --out_dir /work/schroederrna/methylation/hbecpolya/test_4

it takes more than 20h, I didn't get any output, the xpore still running, but the error report shows

Traceback (most recent call last): File "/home/dywang/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'read_index'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/dywang/.local/bin/xpore", line 8, in sys.exit(main()) File "/home/dywang/.local/lib/python3.9/site-packages/xpore/scripts/xpore.py", line 67, in main options.func(options) File "/home/dywang/.local/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 744, in dataprep parallel_index(eventalign_filepath,chunk_size,out_dir,n_processes,resume) File "/home/dywang/.local/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 60, in parallel_index chunk_complete = chunk[chunk['read_index'] != chunk.iloc[-1]['read_index']] File "/home/dywang/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 3505, in getitem indexer = self.columns.get_loc(key) File "/home/dywang/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc raise KeyError(key) from err KeyError: 'read_index'

can anyone help me with this, thanks a lot!!!

dywang0323 avatar May 04 '22 04:05 dywang0323

Hi @dywang0323,

Do you mind showing the first 10 lines of your hbecpolya_event.tsv file, please? head hbecpolya_event.tsv

Thanks!

Best wishes, Yuk Kei

yuukiiwa avatar May 12 '22 23:05 yuukiiwa