remora
remora copied to clipboard
Urgent Help: Cannot Speed up Metric Extraction script
Hello, I have been working on trying to extract metrics with remora on a very large nanopore dataset. I am running into issues with the speed at which I can extract the metrics. I am essentially attempting to extract metrics for 100bp windows I have created. Below is the code that I am using which is working however it is taking extremely long. Is there any way to more efficiently acheive what I want?
`# Iterate through windows and process each read
for _, window_row in windows.head(1000).iterrows():
# Extract window information
chromosome = window_row['chr']
mapped_start = int(window_row['mapped_start'])
mapped_end = int(window_row['mapped_end'] + 1)
read_id = window_row['read_id']
strand = window_row['strand']
feature = window_row['feature']
# Format mapped_start and mapped_end with underscores
formatted_mapped_start = f"{mapped_start:_}"
formatted_mapped_end = f"{mapped_end:_}"
# Check if the Pod5 file associated with the read is loaded
if read_pod5_mapping.get(read_id) in pod5_readers:
# Retrieve the Pod5 reader associated with the BAM read
reader = pod5_readers[read_pod5_mapping[read_id]]
# Read the selected read from the Pod5 file
pod5_read = next(reader.reads(selection=[read_id]))
bam_read = bam_fh.get_first_alignment(read_id)
io_read = io.Read.from_pod5_and_alignment(pod5_read, bam_read)
sample_rate = pod5_read.run_info.sample_rate
# Define the reference region for the read
ref_reg = io.RefRegion(ctg=str(chromosome), strand=strand, start=int(formatted_mapped_start), end=int(formatted_mapped_end))
io_read.set_refine_signal_mapping(sig_map_refiner, ref_mapping=True) #perform signal mapping refinement on reference mapping
try:
read_metrics = io_read.compute_per_base_metric("dwell", ref_anchored=True, region=ref_reg)
translocation_times = read_metrics["dwell"] / sample_rate`
There are many considerations here, but I will start with the simplest steps to speed up metric extraction.
The first is that I would recommend using the io.get_ref_reg_samples_metrics
function which will compute the metrics and return them in an easy to use numpy arrays. Note that if you have many POD5 files per sample the pod5 library now support directory input to the pod5.DatasetReader
interface. You can see these snippets in the metrics extraction notebook.
The next recommendation would be to increase the window size. Generally a value close to the read length median is a good value to use. If you go too much larger the regions will contain too much empty space and currently do not support sparse matrices. If you go smaller then you end up having to process the same read several times in order to extract the reference-anchored metrics. This is likely what is happening in the script posted here. If one wanted to make a blazing fast metric extraction reads could be processed once and all of the metrics from the read output in to an efficient format for the intended downstream purpose.
I think these updates should get you most of the way to a much more efficient metric extraction code snippet.
Note also that the sample rate is likely the same for the whole run. So you could extract the dwell matrix and then divide the whole matrix by the sample rate at the end.
@marcus1487 thank you so much for the prompt reply I really appreciate that. These fixes are great however I am running into an issue where a Remora Error (Bam record not found in POD5) stops the execution of the entire get_ref_reg_samples_metrics
call. Is there a way to implement a fix to continue executing the get_ref_reg_samples_metrics
function even when an error is encountered? I attempted to use a try and except block however when the error is raised the get_ref_reg_samples_metrics
function does not complete.
I've just pushed a branch (metrics_missing_ok
) which exposes the missing_ok
argument to the get_ref_reg_samples_metrics
function. If you can re-install remora with this branch from github and supply the missing_ok=True
parameter to the get_ref_reg_samples_metrics
you should be able to use this method. I'll get this into the next release as well, but this should get you going for now.
Thank you Marcus, that issue seems to be resolved now. My one final question is once I have the numpy arrays, how do I access a specific read by its read_id. For example if the shape of this call samples_metrics[0]['dwell'].shape
results in (70, 460068), how can I see the read_ids that these 70 reads came from. I need to be able to access specific regions on specific reads however I cannot do this If I do not know which read_id each of the 70 reads came from? If this doesnt make sense, I dont want to index the 70 reads using ints, I want to index using the actual name of the read_id...
You'll have to dig a level deeper into the API to extract the read_ids for each row in the metrics arrays. If you look at the get_ref_reg_sample_metrics
function in the code here, you can extract the read ids from the io_reads objects. I could consider adding read_id
as another key to the returned metrics dictionary. In order to avoid breaking other deps that use this interface (as it is likely the most powerful API function in the repo) I would probably make this an option with the default to return the dict as in the current form. I will look into adding this feature now.
After extracting the read id's though, how would you go about adding the value to the dictionary. I am struggling to understand how I can change the existing code
#read_ids = []
io_reads = get_io_reads(bam_reads, pod5_fh, reverse_signal, missing_ok=True)
if sig_map_refiner is not None and not skip_sig_map_refine:
for io_read in io_reads:
io_read.set_refine_signal_mapping(sig_map_refiner, ref_mapping=True)
#read_ids.append(io_read.read_id)
#print(len(read_ids))
#print(read_ids)
sample_metrics = [
io_read.compute_per_base_metric(metric, region=ref_reg, **kwargs)
for io_read in io_reads
]
if len(sample_metrics) > 0:
reg_metrics = dict(
(
metric_name, #added
np.stack([mv[metric_name] for mv in sample_metrics]),
)
for metric_name in sample_metrics[0].keys()
)
How would I go about adding these read_ids so I can index the numpy arrays with them? Or were you talking about creating a different numpy array per read. I am very confused with this my apologies.
I do not know of any way to directly index into a numpy array rows with string keys. I am thinking about a dict with the read_id to the numpy array row index for that read id. So something like read_id_indices = dict((io_read.read_id, rid_idx) for rid_idx, io_read in enumerate(io_reads))
. And then you could index into the numpy array with reg_metrics[metric_name][read_id_indices[read_id]]
. The feature to add to the code would be to add this read_id dict to the reg_metrics dict. Thus you could do reg_metrics[metric_name][reg_metrics["read_id"][read_id]]