remora Urgent Help: Cannot Speed up Metric Extraction script

Urgent Help: Cannot Speed up Metric Extraction script

Open alexturcoo opened this issue 4 months ago • 7 comments

Hello, I have been working on trying to extract metrics with remora on a very large nanopore dataset. I am running into issues with the speed at which I can extract the metrics. I am essentially attempting to extract metrics for 100bp windows I have created. Below is the code that I am using which is working however it is taking extremely long. Is there any way to more efficiently acheive what I want?

`# Iterate through windows and process each read
    for _, window_row in windows.head(1000).iterrows():
        # Extract window information
        chromosome = window_row['chr']
        mapped_start = int(window_row['mapped_start'])
        mapped_end = int(window_row['mapped_end'] + 1)
        read_id = window_row['read_id']
        strand = window_row['strand']
        feature = window_row['feature']

        # Format mapped_start and mapped_end with underscores
        formatted_mapped_start = f"{mapped_start:_}"
        formatted_mapped_end = f"{mapped_end:_}"

        # Check if the Pod5 file associated with the read is loaded
        if read_pod5_mapping.get(read_id) in pod5_readers:
            # Retrieve the Pod5 reader associated with the BAM read
            reader = pod5_readers[read_pod5_mapping[read_id]]

            # Read the selected read from the Pod5 file
            pod5_read = next(reader.reads(selection=[read_id]))
            bam_read = bam_fh.get_first_alignment(read_id)
            io_read = io.Read.from_pod5_and_alignment(pod5_read, bam_read)
            sample_rate = pod5_read.run_info.sample_rate

            # Define the reference region for the read
            ref_reg = io.RefRegion(ctg=str(chromosome), strand=strand, start=int(formatted_mapped_start), end=int(formatted_mapped_end))

            io_read.set_refine_signal_mapping(sig_map_refiner, ref_mapping=True) #perform signal mapping refinement on reference mapping

            try:
                read_metrics = io_read.compute_per_base_metric("dwell", ref_anchored=True, region=ref_reg)
                translocation_times = read_metrics["dwell"] / sample_rate`

Mar 06 '24 14:03 alexturcoo

There are many considerations here, but I will start with the simplest steps to speed up metric extraction.

The first is that I would recommend using the io.get_ref_reg_samples_metrics function which will compute the metrics and return them in an easy to use numpy arrays. Note that if you have many POD5 files per sample the pod5 library now support directory input to the pod5.DatasetReader interface. You can see these snippets in the metrics extraction notebook.

The next recommendation would be to increase the window size. Generally a value close to the read length median is a good value to use. If you go too much larger the regions will contain too much empty space and currently do not support sparse matrices. If you go smaller then you end up having to process the same read several times in order to extract the reference-anchored metrics. This is likely what is happening in the script posted here. If one wanted to make a blazing fast metric extraction reads could be processed once and all of the metrics from the read output in to an efficient format for the intended downstream purpose.

I think these updates should get you most of the way to a much more efficient metric extraction code snippet.

Note also that the sample rate is likely the same for the whole run. So you could extract the dwell matrix and then divide the whole matrix by the sample rate at the end.

Mar 06 '24 23:03 marcus1487

@marcus1487 thank you so much for the prompt reply I really appreciate that. These fixes are great however I am running into an issue where a Remora Error (Bam record not found in POD5) stops the execution of the entire get_ref_reg_samples_metrics call. Is there a way to implement a fix to continue executing the get_ref_reg_samples_metrics function even when an error is encountered? I attempted to use a try and except block however when the error is raised the get_ref_reg_samples_metrics function does not complete.

Mar 07 '24 20:03 alexturcoo

I've just pushed a branch (metrics_missing_ok) which exposes the missing_ok argument to the get_ref_reg_samples_metrics function. If you can re-install remora with this branch from github and supply the missing_ok=True parameter to the get_ref_reg_samples_metrics you should be able to use this method. I'll get this into the next release as well, but this should get you going for now.

Mar 07 '24 20:03 marcus1487

Thank you Marcus, that issue seems to be resolved now. My one final question is once I have the numpy arrays, how do I access a specific read by its read_id. For example if the shape of this call samples_metrics[0]['dwell'].shape results in (70, 460068), how can I see the read_ids that these 70 reads came from. I need to be able to access specific regions on specific reads however I cannot do this If I do not know which read_id each of the 70 reads came from? If this doesnt make sense, I dont want to index the 70 reads using ints, I want to index using the actual name of the read_id...

Mar 08 '24 18:03 alexturcoo

You'll have to dig a level deeper into the API to extract the read_ids for each row in the metrics arrays. If you look at the get_ref_reg_sample_metrics function in the code here, you can extract the read ids from the io_reads objects. I could consider adding read_id as another key to the returned metrics dictionary. In order to avoid breaking other deps that use this interface (as it is likely the most powerful API function in the repo) I would probably make this an option with the default to return the dict as in the current form. I will look into adding this feature now.

Mar 08 '24 19:03 marcus1487

After extracting the read id's though, how would you go about adding the value to the dictionary. I am struggling to understand how I can change the existing code

#read_ids = []
    io_reads = get_io_reads(bam_reads, pod5_fh, reverse_signal, missing_ok=True)
    if sig_map_refiner is not None and not skip_sig_map_refine:
        for io_read in io_reads:
            io_read.set_refine_signal_mapping(sig_map_refiner, ref_mapping=True)
            #read_ids.append(io_read.read_id)
            #print(len(read_ids))
            #print(read_ids)
    sample_metrics = [
        io_read.compute_per_base_metric(metric, region=ref_reg, **kwargs)
        for io_read in io_reads
    ]

    if len(sample_metrics) > 0:
        reg_metrics = dict(
            (
                metric_name, #added
                np.stack([mv[metric_name] for mv in sample_metrics]),
            )
            for metric_name in sample_metrics[0].keys()
        )

How would I go about adding these read_ids so I can index the numpy arrays with them? Or were you talking about creating a different numpy array per read. I am very confused with this my apologies.

Mar 08 '24 20:03 alexturcoo

I do not know of any way to directly index into a numpy array rows with string keys. I am thinking about a dict with the read_id to the numpy array row index for that read id. So something like read_id_indices = dict((io_read.read_id, rid_idx) for rid_idx, io_read in enumerate(io_reads)). And then you could index into the numpy array with reg_metrics[metric_name][read_id_indices[read_id]]. The feature to add to the code would be to add this read_id dict to the reg_metrics dict. Thus you could do reg_metrics[metric_name][reg_metrics["read_id"][read_id]]

Mar 13 '24 20:03 marcus1487

remora remora copied to clipboard

Urgent Help: Cannot Speed up Metric Extraction script

remora
remora copied to clipboard