xpore
xpore copied to clipboard
xpore/scripts/xpore-dataprep: steps
Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.
Current steps:
- For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.
- The nanopolish-eventalign output looks like
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx
ENST00000305885.2 65 GGAGC 527492 t 3 114.39 8.741 0.01328 GGAGC 121.19 5.69 -1.04 100036 100076
ENST00000305885.2 65 GGAGC 527492 t 4 122.54 6.191 0.00266 GGAGC 121.19 5.69 0.21 100028 100036
ENST00000305885.2 65 GGAGC 527492 t 5 112.07 8.895 0.00564 GGAGC 121.19 5.69 -1.39 100011 100028
ENST00000305885.2 65 GGAGC 527492 t 6 118.68 3.960 0.00232 GGAGC 121.19 5.69 -0.38 100004 100011
ENST00000305885.2 65 GGAGC 527492 t 7 120.74 5.917 0.00266 GGAGC 121.19 5.69 -0.07 99996 100004
ENST00000305885.2 65 GGAGC 527492 t 8 126.58 7.295 0.00631 GGAGC 121.19 5.69 0.82 99977 99996
ENST00000305885.2 66 GAGCA 527492 t 9 105.75 7.531 0.01162 GAGCA 107.01 3.02 -0.36 99942 99977
ENST00000305885.2 66 GAGCA 527492 t 10 114.03 2.819 0.00299 GAGCA 107.01 3.02 2.02 99933 99942
ENST00000305885.2 66 GAGCA 527492 t 11 100.41 6.246 0.00199 GAGCA 107.01 3.02 -1.90 99927 99933
- For example, we want to combine the first 6 lines of GGAGA segments.
- This is done by
parallel_combinewhich generates and loads tasks to a task queue and thencombinewill process it in parallel. Each task is for one read. - In the
combinefunction, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using agroupbymanner. - The output of this step is
eventalign.logandeventalign.hdf5which is in the format below.
[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']
Issues
- We normally have 1-3 millions of reads.
- Generate read count from the bamtx file.
- This is done by `count_reads' (not in parallel because it is fast enough, compared to the other two steps).
- The output of this step is
read_count.csvwhich contains the number of reads per transcript and mapstranscript_idtochr, gene_id, gene_name
% header
chr, gene_id, gene_name, transcript_id, n_reads
- This is necessary because our model (xpore-diffmod) will consider only the genomic positions that have more than certain number of aligned reads. So, we will not pass those genes with lower number of reads to Step 3 for saving the computational time.
- Create a .json file, where the info of all reads are stored per genomic position, for modelling.
- This is done by
parallel_preprocess' which generates and loads tasks to a task queue and thenpreprocess' will process it in parallel. - The output of this step is
data.log,data.json, anddata.index. - The `data.json' file has the format below.
{
<gene_id>: {
<genomic_position>: {
<kmer>:{
"norm_means": array of float (todo: with 2 decimal points),
"read_ids": array of string.
}
}
}
}
- The `data.index' file contains the index of the start and end of the information for each gene for random access.
Issues
- If we lower the threshold of read counts, we may have ~10,000 genes.
Discussion
- Step 1 and 3 are the bottleneck.
- We can change the file formats if it makes the processing time faster and requires less storage.
- Do these three steps make sense? Is there any better other way to achieve the goal?
Change the starting and ending of the data.json file because we don't need it! Other file formats should be considered.
Updated: Major changes are the following.
- Only 2 steps left.
- Convert trascriptomic to genomic on the fly.
- Users can choose to model on genomic or transcriptomic coordinates.
- The format of data.json has been changed.
Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.
Current steps:
- For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.
- The nanopolish-eventalign output looks like
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx
ENST00000305885.2 65 GGAGC 527492 t 3 114.39 8.741 0.01328 GGAGC 121.19 5.69 -1.04 100036 100076
ENST00000305885.2 65 GGAGC 527492 t 4 122.54 6.191 0.00266 GGAGC 121.19 5.69 0.21 100028 100036
ENST00000305885.2 65 GGAGC 527492 t 5 112.07 8.895 0.00564 GGAGC 121.19 5.69 -1.39 100011 100028
ENST00000305885.2 65 GGAGC 527492 t 6 118.68 3.960 0.00232 GGAGC 121.19 5.69 -0.38 100004 100011
ENST00000305885.2 65 GGAGC 527492 t 7 120.74 5.917 0.00266 GGAGC 121.19 5.69 -0.07 99996 100004
ENST00000305885.2 65 GGAGC 527492 t 8 126.58 7.295 0.00631 GGAGC 121.19 5.69 0.82 99977 99996
ENST00000305885.2 66 GAGCA 527492 t 9 105.75 7.531 0.01162 GAGCA 107.01 3.02 -0.36 99942 99977
ENST00000305885.2 66 GAGCA 527492 t 10 114.03 2.819 0.00299 GAGCA 107.01 3.02 2.02 99933 99942
ENST00000305885.2 66 GAGCA 527492 t 11 100.41 6.246 0.00199 GAGCA 107.01 3.02 -1.90 99927 99933
- For example, we want to combine the first 6 lines of GGAGA segments.
- This is done by
parallel_combinewhich generates and loads tasks to a task queue and thencombinewill process it in parallel. Each task is for one read. - In the
combinefunction, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using agroupbymanner. - The output of this step is
eventalign.logandeventalign.hdf5which is in the format below.
[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']
Issues
- We normally have 1-3 millions of reads.
- Create a .json file, where the info of all reads are stored per genomic position, for modelling.
- This is done by
parallel_preprocess' which generates and loads tasks to a task queue and thenpreprocess' will process it in parallel. - The output of this step is
data.log,data.json,data.index, anddata.readcount. - The `data.json' file has the format below.
{
<gene_id>: {
<genomic_position>: {
<kmer>: array of .2f
}
}
}
- The `data.index' file contains the index of the start and end of the information for each gene for random access.
Issues
- Some genes have >1000 reads.
Discussion
- Step 2 uses a lot of memory if a gene has many reads -- to measure properly.
- We can change the file formats if it makes the processing time faster and requires less storage.
- Do these three steps make sense? Is there any better other way to achieve the goal?
Updated: Major changes are the following.
- Limit the number of reads. Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.
Current steps:
- For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.
- The nanopolish-eventalign output looks like
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx
ENST00000305885.2 65 GGAGC 527492 t 3 114.39 8.741 0.01328 GGAGC 121.19 5.69 -1.04 100036 100076
ENST00000305885.2 65 GGAGC 527492 t 4 122.54 6.191 0.00266 GGAGC 121.19 5.69 0.21 100028 100036
ENST00000305885.2 65 GGAGC 527492 t 5 112.07 8.895 0.00564 GGAGC 121.19 5.69 -1.39 100011 100028
ENST00000305885.2 65 GGAGC 527492 t 6 118.68 3.960 0.00232 GGAGC 121.19 5.69 -0.38 100004 100011
ENST00000305885.2 65 GGAGC 527492 t 7 120.74 5.917 0.00266 GGAGC 121.19 5.69 -0.07 99996 100004
ENST00000305885.2 65 GGAGC 527492 t 8 126.58 7.295 0.00631 GGAGC 121.19 5.69 0.82 99977 99996
ENST00000305885.2 66 GAGCA 527492 t 9 105.75 7.531 0.01162 GAGCA 107.01 3.02 -0.36 99942 99977
ENST00000305885.2 66 GAGCA 527492 t 10 114.03 2.819 0.00299 GAGCA 107.01 3.02 2.02 99933 99942
ENST00000305885.2 66 GAGCA 527492 t 11 100.41 6.246 0.00199 GAGCA 107.01 3.02 -1.90 99927 99933
- For example, we want to combine the first 6 lines of GGAGA segments.
- This is done by
parallel_combinewhich generates and loads tasks to a task queue and thencombinewill process it in parallel. Each task is for one read. - In the
combinefunction, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using agroupbymanner. - The output of this step is
eventalign.logandeventalign.hdf5which is in the format below.
[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']
Issues
- We normally have 1-3 millions of reads.
- Create a .json file, where the info of all reads are stored per genomic position, for modelling.
- This is done by
parallel_preprocess' which generates and loads tasks to a task queue and thenpreprocess' will process it in parallel. - The output of this step is
data.log,data.json,data.index, anddata.readcount. - The `data.json' file has the format below.
{
<gene_id>: {
<genomic_position>: {
<kmer>: array of .2f
}
}
}
- The `data.index' file contains the index of the start and end of the information for each gene for random access.
- We keep only
readcount_minreads (default = 1000) per gene.
Discussion
- We can change the file formats if it makes the processing time faster and requires less storage.
- Do these three steps make sense? Is there any better other way to achieve the goal?