xpore xpore/scripts/xpore-dataprep: steps

Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.

The nanopolish-eventalign output looks like

contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx 
ENST00000305885.2	65	GGAGC	527492	t	3	114.39	8.741	0.01328	GGAGC	121.19	5.69	-1.04	100036	100076
ENST00000305885.2	65	GGAGC	527492	t	4	122.54	6.191	0.00266	GGAGC	121.19	5.69	0.21	100028	100036
ENST00000305885.2	65	GGAGC	527492	t	5	112.07	8.895	0.00564	GGAGC	121.19	5.69	-1.39	100011	100028
ENST00000305885.2	65	GGAGC	527492	t	6	118.68	3.960	0.00232	GGAGC	121.19	5.69	-0.38	100004	100011
ENST00000305885.2	65	GGAGC	527492	t	7	120.74	5.917	0.00266	GGAGC	121.19	5.69	-0.07	99996	100004
ENST00000305885.2	65	GGAGC	527492	t	8	126.58	7.295	0.00631	GGAGC	121.19	5.69	0.82	99977	99996
ENST00000305885.2	66	GAGCA	527492	t	9	105.75	7.531	0.01162	GAGCA	107.01	3.02	-0.36	99942	99977
ENST00000305885.2	66	GAGCA	527492	t	10	114.03	2.819	0.00299	GAGCA	107.01	3.02	2.02	99933	99942
ENST00000305885.2	66	GAGCA	527492	t	11	100.41	6.246	0.00199	GAGCA	107.01	3.02	-1.90	99927	99933

For example, we want to combine the first 6 lines of GGAGA segments.
This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.
The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.

[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

Issues

We normally have 1-3 millions of reads.

Generate read count from the bamtx file.

This is done by `count_reads' (not in parallel because it is fast enough, compared to the other two steps).
The output of this step is read_count.csv which contains the number of reads per transcript and maps transcript_id to chr, gene_id, gene_name

% header
chr, gene_id, gene_name, transcript_id, n_reads

This is necessary because our model (xpore-diffmod) will consider only the genomic positions that have more than certain number of aligned reads. So, we will not pass those genes with lower number of reads to Step 3 for saving the computational time.

Create a .json file, where the info of all reads are stored per genomic position, for modelling.

This is done by parallel_preprocess' which generates and loads tasks to a task queue and then preprocess' will process it in parallel.
The output of this step is data.log, data.json, and data.index.
The `data.json' file has the format below.

{
	<gene_id>: {
		<genomic_position>: {
			<kmer>:{
				"norm_means": array of float (todo: with 2 decimal points),
				"read_ids": array of string.
			}
		}
	}
}

The `data.index' file contains the index of the start and end of the information for each gene for random access.

Issues

If we lower the threshold of read counts, we may have ~10,000 genes.

Discussion

Step 1 and 3 are the bottleneck.
We can change the file formats if it makes the processing time faster and requires less storage.
Do these three steps make sense? Is there any better other way to achieve the goal?

Feb 07 '20 03:02 ploy-np

Change the starting and ending of the data.json file because we don't need it! Other file formats should be considered.

Mar 17 '20 05:03 ploy-np

Updated: Major changes are the following.

Only 2 steps left.
Convert trascriptomic to genomic on the fly.
Users can choose to model on genomic or transcriptomic coordinates.
The format of data.json has been changed.

Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.

The nanopolish-eventalign output looks like

contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx 
ENST00000305885.2	65	GGAGC	527492	t	3	114.39	8.741	0.01328	GGAGC	121.19	5.69	-1.04	100036	100076
ENST00000305885.2	65	GGAGC	527492	t	4	122.54	6.191	0.00266	GGAGC	121.19	5.69	0.21	100028	100036
ENST00000305885.2	65	GGAGC	527492	t	5	112.07	8.895	0.00564	GGAGC	121.19	5.69	-1.39	100011	100028
ENST00000305885.2	65	GGAGC	527492	t	6	118.68	3.960	0.00232	GGAGC	121.19	5.69	-0.38	100004	100011
ENST00000305885.2	65	GGAGC	527492	t	7	120.74	5.917	0.00266	GGAGC	121.19	5.69	-0.07	99996	100004
ENST00000305885.2	65	GGAGC	527492	t	8	126.58	7.295	0.00631	GGAGC	121.19	5.69	0.82	99977	99996
ENST00000305885.2	66	GAGCA	527492	t	9	105.75	7.531	0.01162	GAGCA	107.01	3.02	-0.36	99942	99977
ENST00000305885.2	66	GAGCA	527492	t	10	114.03	2.819	0.00299	GAGCA	107.01	3.02	2.02	99933	99942
ENST00000305885.2	66	GAGCA	527492	t	11	100.41	6.246	0.00199	GAGCA	107.01	3.02	-1.90	99927	99933

For example, we want to combine the first 6 lines of GGAGA segments.
This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.
The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.

[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

Issues

We normally have 1-3 millions of reads.

Create a .json file, where the info of all reads are stored per genomic position, for modelling.

This is done by parallel_preprocess' which generates and loads tasks to a task queue and then preprocess' will process it in parallel.
The output of this step is data.log, data.json, data.index, and data.readcount.
The `data.json' file has the format below.

{
	<gene_id>: {
		<genomic_position>: {
			<kmer>: array of .2f
		}
	}
}

The `data.index' file contains the index of the start and end of the information for each gene for random access.

Issues

Some genes have >1000 reads.

Discussion

Step 2 uses a lot of memory if a gene has many reads -- to measure properly.
We can change the file formats if it makes the processing time faster and requires less storage.
Do these three steps make sense? Is there any better other way to achieve the goal?

May 25 '20 05:05 ploy-np

Updated: Major changes are the following.

Limit the number of reads. Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.

The nanopolish-eventalign output looks like

contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx 
ENST00000305885.2	65	GGAGC	527492	t	3	114.39	8.741	0.01328	GGAGC	121.19	5.69	-1.04	100036	100076
ENST00000305885.2	65	GGAGC	527492	t	4	122.54	6.191	0.00266	GGAGC	121.19	5.69	0.21	100028	100036
ENST00000305885.2	65	GGAGC	527492	t	5	112.07	8.895	0.00564	GGAGC	121.19	5.69	-1.39	100011	100028
ENST00000305885.2	65	GGAGC	527492	t	6	118.68	3.960	0.00232	GGAGC	121.19	5.69	-0.38	100004	100011
ENST00000305885.2	65	GGAGC	527492	t	7	120.74	5.917	0.00266	GGAGC	121.19	5.69	-0.07	99996	100004
ENST00000305885.2	65	GGAGC	527492	t	8	126.58	7.295	0.00631	GGAGC	121.19	5.69	0.82	99977	99996
ENST00000305885.2	66	GAGCA	527492	t	9	105.75	7.531	0.01162	GAGCA	107.01	3.02	-0.36	99942	99977
ENST00000305885.2	66	GAGCA	527492	t	10	114.03	2.819	0.00299	GAGCA	107.01	3.02	2.02	99933	99942
ENST00000305885.2	66	GAGCA	527492	t	11	100.41	6.246	0.00199	GAGCA	107.01	3.02	-1.90	99927	99933

For example, we want to combine the first 6 lines of GGAGA segments.
This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.
The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.

[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

Issues

We normally have 1-3 millions of reads.

Create a .json file, where the info of all reads are stored per genomic position, for modelling.

This is done by parallel_preprocess' which generates and loads tasks to a task queue and then preprocess' will process it in parallel.
The output of this step is data.log, data.json, data.index, and data.readcount.
The `data.json' file has the format below.

{
	<gene_id>: {
		<genomic_position>: {
			<kmer>: array of .2f
		}
	}
}

The `data.index' file contains the index of the start and end of the information for each gene for random access.
We keep only readcount_min reads (default = 1000) per gene.

Discussion

We can change the file formats if it makes the processing time faster and requires less storage.
Do these three steps make sense? Is there any better other way to achieve the goal?

Jun 11 '20 05:06 ploy-np

xpore xpore copied to clipboard

xpore/scripts/xpore-dataprep: steps

xpore
xpore copied to clipboard