tombo icon indicating copy to clipboard operation
tombo copied to clipboard

How to quickly extract the relevant data of multiple regions of genome

Open weir12 opened this issue 5 years ago • 3 comments

Hi: Recently, I often need to use tombo API to extract data in many regions (such as the mean electrical signal value). It seems that this class tombo.tombo_helper.intervalData Only one region is supported. So I need to create n objects of that class,and apply .add_reads(reads_index).get_base_levels() for each object. Meanwhile,tombo.tombo_helper.TomboReads It doesn't seem to be well-supported by multiprocessing of python. Child processes compete to gain access to this shared object,Only one process has access at any one time(like a lock). My current approach is to have all the children read the index file independently,But the result is a huge drain on computing resources. Could you improve the situation if you don't mind. Or there are better solutions which I just don't know. best wishes

weir12 avatar Aug 05 '20 17:08 weir12

If you are working with processes (and not threads) then the TomboReads object should not be a shared object and there should not be any locks on access to this data structure. In fact this is exactly how the tombo statistic aggregation multiprocessing works (here in the code). If you could share more details about your implementation with a minimal reproducible example and benchmarks I would be happy to take a further look.

What you may be experiencing is a block on the single FAST5 files being accessed simultaneously by multiple processes. This is one of the major flaws with the storage in single FAST5 files within the Tombo framework. This will unfortunately not be solved in the Tombo framework, but we are aiming to address these issues with a new software package at a future date.

marcus1487 avatar Aug 05 '20 17:08 marcus1487

Hi,here is my multi-process approach

image

When this code executes, only one process is always in the "R" state,with 100% CPU,and other processes is in "S" with 0% CPU Thank you for your help

With this method, the child gets a copy of the sample_index of main process instead of a Shared memory address.But I still don't understand why the child process is blocked

weir12 avatar Aug 06 '20 11:08 weir12

This test looks to be processing only a single read. Thus my best guess is that each process acquires a lock on the HDF5 file for reading, extracts the data and then releases the lock. If this were to be scaled up to many reads the processes could likely process the data more efficiently using multiple cores.

marcus1487 avatar Aug 11 '20 17:08 marcus1487