tombo icon indicating copy to clipboard operation
tombo copied to clipboard

tombo access to tarballs to offset issues of file and storage limits

Open JohnUrban opened this issue 6 years ago • 6 comments

Hi Marcus,

I didn't look at all the code to be sure -- but it seems like accessing tarballs is not an option for Tombo (given a skimming of the docs and a couple trials).

The problem is that MinION runs are generating hundreds of thousands and into the millions of files now. Moreover, the storage needs are always increasing. I deal with both file limit and storage quotas. As an example, one of my accounts allows 1 million files and another account allows 4 million files. Each has a storage limit of 5-10 TB -- although not all of that is free of course.

To deal with this, I keep all the fast5s of a given experiment in a tar.gz file to reduce the number of files and storage space. All of my scripts for working with fast5s are able to deal with this by extracting a single fast5 (though it could be a batch) at a time, doing stuff, then deleting it and moving on to the next one. I guess since Tombo actually modifies the fast5 file, the solution would be a little trickier -- one would want the modified version of the file to replace the older one in the tarball before deleting it. I'm not sure how straight forward that would be to add -- but the tarball would need to be gunzipped for Tombo to do this. Then Tombo can extract files, update them, and update the tar with them via the python version of: tar -uf fast5dir.tar file.fast5.

There are obvious work-arounds -- such as extracting the files from tarball, removing the tarball, doing Tombo stuff, then re-tarring everything. The problem there, however, is that this could create conditions of being near file number limits that make it difficult to do anything else -- even anything else with Tombo (e.g. on the contents of a different tarball).

Anyway, I am curious to know if there are any plans for Tombo to deal with tarballs or something similar.

Thanks for any thoughts -- and I completely understand if this is not a priority. I am just dealing with a little bit of this at the moment and thought I'd raise it as an issue that you or others might care about.

John

JohnUrban avatar Mar 28 '18 16:03 JohnUrban

Yes. I have put some thought into this, but given the issues you mentioned (esp. the writing back and thus re-tar/gzipping) I think this is certainly not a priority at the moment. The long term goal here might be to do away with adding information back to the FAST5 files and instead storing all re-squiggle data (including base event levels) in some sort of index data base. This is quite a large change in the structure of Tombo and thus I don't think it will be a priority right now, but I will leave this as an open issue. I do think this is an important point and should remain as a long term goal for Tombo.

marcus1487 avatar Mar 28 '18 17:03 marcus1487

Hi,

I think I am running into a similar issue, I am trying to run tombo test_significance and it runs fine unless I add the --per-read-statistics-basename to the command. When I add that I get :

Traceback (most recent call last): File "/miniconda2/envs/tombo/lib/python3.6/multiprocessing/queues.py", line 240, in _feed send_bytes(obj) File "/miniconda2/envs/tombo/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/miniconda2/envs/tombo/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647 Tabulating all stats.

I am trying to analyze about 400k reads. I have 256GB of memory, so I don't think it is a hardware bottleneck issue. Is there any known workaround this issue?

Thank You.

Tombo is truly an amazing tool!

alex-sal avatar Mar 29 '18 23:03 alex-sal

I'm glad you are finding Tombo useful!

A bit of googling seems to indicate that python multiprocess might have a limit on the size of an array passed around between the multiprocessing objects. So it isn't really a system memory issue as much of an internal MPI(?) limit (not sure how to figure out where from where that 2147483647 actually comes).

The fix within Tombo would be to set the --multiprocess-region-size option to a smaller value. This will cause the amount of memory coming from a single process to be smaller and should resolve this issue. The problem here is that it is not so much about how many reads you have in total, but how much coverage you might have over any individual processed region, which might be very high in some use cases. Hopefully this resolves your issue.

marcus1487 avatar Apr 02 '18 14:04 marcus1487

Hi, I have the same issue with the file limits. We have > 2,700,000 reads in a PromthION run for one sample.

In order to use the tombo preprocess annotate_raw_with_fastq and tombo squiggle, I need to convert the multi-read fast5 to single-read fast5. This step is causing problem, as the file number (the single-read fast5 files) largely exceed the file limits in my system, and our cluster admin does not like the idea of increasing the limit for me. So may I ask whether it's possible to feed the multi-read fast5 to tombo? I tried it before, and it failed. But just like to ask with you expertise if you could help me solve this problem?

Thanks a lot in advance!

zmz1988 avatar Jun 28 '23 10:06 zmz1988

Tombo has been deprecated with multi FAST5 support among the many reasons. Please see the latest Remora raw signal analysis workflows that can hopefully replace the Tombo workflows. Or raise issues in Remora for missing features from Tombo.

marcus1487 avatar Jun 28 '23 12:06 marcus1487

Ok, thanks for replying me!

zmz1988 avatar Jun 28 '23 13:06 zmz1988