xpore icon indicating copy to clipboard operation
xpore copied to clipboard

xpore/scripts/xpore-dataprep: steps

Open ploy-np opened this issue 5 years ago • 3 comments

Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

  1. For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.
  • The nanopolish-eventalign output looks like
contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx 
ENST00000305885.2	65	GGAGC	527492	t	3	114.39	8.741	0.01328	GGAGC	121.19	5.69	-1.04	100036	100076
ENST00000305885.2	65	GGAGC	527492	t	4	122.54	6.191	0.00266	GGAGC	121.19	5.69	0.21	100028	100036
ENST00000305885.2	65	GGAGC	527492	t	5	112.07	8.895	0.00564	GGAGC	121.19	5.69	-1.39	100011	100028
ENST00000305885.2	65	GGAGC	527492	t	6	118.68	3.960	0.00232	GGAGC	121.19	5.69	-0.38	100004	100011
ENST00000305885.2	65	GGAGC	527492	t	7	120.74	5.917	0.00266	GGAGC	121.19	5.69	-0.07	99996	100004
ENST00000305885.2	65	GGAGC	527492	t	8	126.58	7.295	0.00631	GGAGC	121.19	5.69	0.82	99977	99996
ENST00000305885.2	66	GAGCA	527492	t	9	105.75	7.531	0.01162	GAGCA	107.01	3.02	-0.36	99942	99977
ENST00000305885.2	66	GAGCA	527492	t	10	114.03	2.819	0.00299	GAGCA	107.01	3.02	2.02	99933	99942
ENST00000305885.2	66	GAGCA	527492	t	11	100.41	6.246	0.00199	GAGCA	107.01	3.02	-1.90	99927	99933
  • For example, we want to combine the first 6 lines of GGAGA segments.
  • This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
  • In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.
  • The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.
[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

Issues

  • We normally have 1-3 millions of reads.
  1. Generate read count from the bamtx file.
  • This is done by `count_reads' (not in parallel because it is fast enough, compared to the other two steps).
  • The output of this step is read_count.csv which contains the number of reads per transcript and maps transcript_id to chr, gene_id, gene_name
% header
chr, gene_id, gene_name, transcript_id, n_reads
  • This is necessary because our model (xpore-diffmod) will consider only the genomic positions that have more than certain number of aligned reads. So, we will not pass those genes with lower number of reads to Step 3 for saving the computational time.
  1. Create a .json file, where the info of all reads are stored per genomic position, for modelling.
  • This is done by parallel_preprocess' which generates and loads tasks to a task queue and then preprocess' will process it in parallel.
  • The output of this step is data.log, data.json, and data.index.
  • The `data.json' file has the format below.
{
	<gene_id>: {
		<genomic_position>: {
			<kmer>:{
				"norm_means": array of float (todo: with 2 decimal points),
				"read_ids": array of string.
			}
		}
	}
}
  • The `data.index' file contains the index of the start and end of the information for each gene for random access.

Issues

  • If we lower the threshold of read counts, we may have ~10,000 genes.

Discussion

  • Step 1 and 3 are the bottleneck.
  • We can change the file formats if it makes the processing time faster and requires less storage.
  • Do these three steps make sense? Is there any better other way to achieve the goal?

ploy-np avatar Feb 07 '20 03:02 ploy-np

Change the starting and ending of the data.json file because we don't need it! Other file formats should be considered.

ploy-np avatar Mar 17 '20 05:03 ploy-np

Updated: Major changes are the following.

  • Only 2 steps left.
  • Convert trascriptomic to genomic on the fly.
  • Users can choose to model on genomic or transcriptomic coordinates.
  • The format of data.json has been changed.

Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

  1. For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.
  • The nanopolish-eventalign output looks like
contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx 
ENST00000305885.2	65	GGAGC	527492	t	3	114.39	8.741	0.01328	GGAGC	121.19	5.69	-1.04	100036	100076
ENST00000305885.2	65	GGAGC	527492	t	4	122.54	6.191	0.00266	GGAGC	121.19	5.69	0.21	100028	100036
ENST00000305885.2	65	GGAGC	527492	t	5	112.07	8.895	0.00564	GGAGC	121.19	5.69	-1.39	100011	100028
ENST00000305885.2	65	GGAGC	527492	t	6	118.68	3.960	0.00232	GGAGC	121.19	5.69	-0.38	100004	100011
ENST00000305885.2	65	GGAGC	527492	t	7	120.74	5.917	0.00266	GGAGC	121.19	5.69	-0.07	99996	100004
ENST00000305885.2	65	GGAGC	527492	t	8	126.58	7.295	0.00631	GGAGC	121.19	5.69	0.82	99977	99996
ENST00000305885.2	66	GAGCA	527492	t	9	105.75	7.531	0.01162	GAGCA	107.01	3.02	-0.36	99942	99977
ENST00000305885.2	66	GAGCA	527492	t	10	114.03	2.819	0.00299	GAGCA	107.01	3.02	2.02	99933	99942
ENST00000305885.2	66	GAGCA	527492	t	11	100.41	6.246	0.00199	GAGCA	107.01	3.02	-1.90	99927	99933
  • For example, we want to combine the first 6 lines of GGAGA segments.
  • This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
  • In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.
  • The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.
[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

Issues

  • We normally have 1-3 millions of reads.
  1. Create a .json file, where the info of all reads are stored per genomic position, for modelling.
  • This is done by parallel_preprocess' which generates and loads tasks to a task queue and then preprocess' will process it in parallel.
  • The output of this step is data.log, data.json, data.index, and data.readcount.
  • The `data.json' file has the format below.
{
	<gene_id>: {
		<genomic_position>: {
			<kmer>: array of .2f
		}
	}
}
  • The `data.index' file contains the index of the start and end of the information for each gene for random access.

Issues

  • Some genes have >1000 reads.

Discussion

  • Step 2 uses a lot of memory if a gene has many reads -- to measure properly.
  • We can change the file formats if it makes the processing time faster and requires less storage.
  • Do these three steps make sense? Is there any better other way to achieve the goal?

ploy-np avatar May 25 '20 05:05 ploy-np

Updated: Major changes are the following.

  • Limit the number of reads. Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

  1. For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.
  • The nanopolish-eventalign output looks like
contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx 
ENST00000305885.2	65	GGAGC	527492	t	3	114.39	8.741	0.01328	GGAGC	121.19	5.69	-1.04	100036	100076
ENST00000305885.2	65	GGAGC	527492	t	4	122.54	6.191	0.00266	GGAGC	121.19	5.69	0.21	100028	100036
ENST00000305885.2	65	GGAGC	527492	t	5	112.07	8.895	0.00564	GGAGC	121.19	5.69	-1.39	100011	100028
ENST00000305885.2	65	GGAGC	527492	t	6	118.68	3.960	0.00232	GGAGC	121.19	5.69	-0.38	100004	100011
ENST00000305885.2	65	GGAGC	527492	t	7	120.74	5.917	0.00266	GGAGC	121.19	5.69	-0.07	99996	100004
ENST00000305885.2	65	GGAGC	527492	t	8	126.58	7.295	0.00631	GGAGC	121.19	5.69	0.82	99977	99996
ENST00000305885.2	66	GAGCA	527492	t	9	105.75	7.531	0.01162	GAGCA	107.01	3.02	-0.36	99942	99977
ENST00000305885.2	66	GAGCA	527492	t	10	114.03	2.819	0.00299	GAGCA	107.01	3.02	2.02	99933	99942
ENST00000305885.2	66	GAGCA	527492	t	11	100.41	6.246	0.00199	GAGCA	107.01	3.02	-1.90	99927	99933
  • For example, we want to combine the first 6 lines of GGAGA segments.
  • This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
  • In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.
  • The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.
[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

Issues

  • We normally have 1-3 millions of reads.
  1. Create a .json file, where the info of all reads are stored per genomic position, for modelling.
  • This is done by parallel_preprocess' which generates and loads tasks to a task queue and then preprocess' will process it in parallel.
  • The output of this step is data.log, data.json, data.index, and data.readcount.
  • The `data.json' file has the format below.
{
	<gene_id>: {
		<genomic_position>: {
			<kmer>: array of .2f
		}
	}
}
  • The `data.index' file contains the index of the start and end of the information for each gene for random access.
  • We keep only readcount_min reads (default = 1000) per gene.

Discussion

  • We can change the file formats if it makes the processing time faster and requires less storage.
  • Do these three steps make sense? Is there any better other way to achieve the goal?

ploy-np avatar Jun 11 '20 05:06 ploy-np