pod5-file-format
pod5-file-format copied to clipboard
memory error
Hi there,
Im using pod5 python API to load signals in to another file. But I meet this error when I iterate through the pod5 reader.
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/projects/myNanoporeProject/extract/extract.py", line 88, in load_pod5_signals_and_save
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 284, in signal
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 284, in <listcomp>
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 380, in _find_signal_row_index
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 1070, in _get_signal_batch
File "pyarrow/ipc.pxi", line 974, in pyarrow.lib._RecordBatchFileReader.get_batch
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 590, in read
MemoryError
I allocated 256GB memory for this task and I think that I just loaded signals for one read each time and did not save all signals in the pod5 file into the memory manually.
Here is the code:
pod5_reads = pod5_reader.reads(selection = read_ids)
for pod5_read in pod5_reads:
read_id = str(pod5_read.read_id)
if read_id in bam_info_map:
read, cpgs = bam_info_map[read_id]
if len(cpgs) > 0:
cpg_chunks = generate_training_data(read, cpgs, pod5_read.signal)
for chunk in cpg_chunks:
chunk = np.asanyarray(chunk, dtype=object)
np.save(f, chunk, allow_pickle=True)
Hi @aCoalBall ,
Can you show some more of the code - specifically how you're handling the pod5_reader
?
Also, are you processing many files, or some very large files?
Hi @aCoalBall ,
Can you show some more of the code - specifically how you're handling the
pod5_reader
?Also, are you processing many files, or some very large files?
Hi HalfPhoton,
I just create the pod5_reader by
pod5_reader = pod5.Reader(pod5_path) # pod5_path is the str of path
And I am processing a single pod5 file with 141G.
By the way, it seems not due to memory runs out. I tried to specify different size of memory but it just shut down at the same position (around 300000th reads)
Interesting.
@aCoalBall can you confirm it still crashes if you don't do your downstream processing?
I wouldn't expect the code above to retain the signals in memory unless you hold them in your training data.
So, this should work:
for pod5_read in pod5_reads:
read_id = str(pod5_read.read_id)
if read_id in bam_info_map:
read, cpgs = bam_info_map[read_id]
if len(cpgs) > 0:
pass
Can you confirm? Are you able to provide the complete code otherwise and we can investigate further.
- George
Interesting.
@aCoalBall can you confirm it still crashes if you don't do your downstream processing?
I wouldn't expect the code above to retain the signals in memory unless you hold them in your training data.
So, this should work:
for pod5_read in pod5_reads: read_id = str(pod5_read.read_id) if read_id in bam_info_map: read, cpgs = bam_info_map[read_id] if len(cpgs) > 0: pass
Can you confirm? Are you able to provide the complete code otherwise and we can investigate further.
- George
@jorj1988 Hi George,
I tried what you suggest, the following code runs fine.
for pod5_read in pod5_reads:
read_id = str(pod5_read.read_id)
if read_id in bam_info_map:
read, cpgs = bam_info_map[read_id]
if len(cpgs) > 0:
pass
However, as long as I use pod5_read.signal, like signal = pod5_read.signal
or x = type(pod5_read.signal)
it raises memory error.
for pod5_read in pod5_reads:
read_id = str(pod5_read.read_id)
if read_id in bam_info_map:
read, cpgs = bam_info_map[read_id]
if len(cpgs) > 0:
signal = pod5_read.signal
Traceback (most recent call last):
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/projects/myNanoporeProject/extract/prepare_data.py", line 67, in <module>
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/projects/myNanoporeProject/extract/prepare_data.py", line 52, in main
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/projects/myNanoporeProject/extract/extract.py", line 95, in load_pod5_signals_and_save
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 284, in signal
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 284, in <listcomp>
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 380, in _find_signal_row_index
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 1070, in _get_signal_batch
File "pyarrow/ipc.pxi", line 974, in pyarrow.lib._RecordBatchFileReader.get_batch
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 590, in read
MemoryError
ok thanks.
What is your environment like - what sort of OS, VMem, Physical Memory available etc?
Thanks,
- George
ok thanks.
What is your environment like - what sort of OS, VMem, Physical Memory available etc?
Thanks,
- George
Hi @jorj1988
Here is some basic info about the system.
And can you confirm if you have a virtual memory limit on the system?
eg:
george@host:~$ ulimit -v
unlimited
@jorj1988
Yes, there is a virtual memory limit
[coalball@gc066 ~]$ ulimit -v
134217728
Hi @aCoalBall ,
Would you be able to test the following code to ascertain if this is virtual memory issue?
pod5_reader = pod5.Reader(pod5_path)
pod5_reader._signal_handle._reader = None
pod5_reader._signal_handle._reader = pod5_reader._signal_handle._open_without_mmap()
for pod5_read in pod5_reads:
# remaining code here...
Hi @aCoalBall ,
Would you be able to test the following code to ascertain if this is virtual memory issue?
pod5_reader = pod5.Reader(pod5_path) pod5_reader._signal_handle._reader = None pod5_reader._signal_handle._reader = pod5_reader._signal_handle._open_without_mmap() for pod5_read in pod5_reads: # remaining code here...
Hi @HalfPhoton , I tried but the error is still there...
Hi @aCoalBall,
I've been running the below on a system similar to yours and not seen a crash yet... can you confirm this does crash for you?
My input file is 1.4TB, 64GB physical memory, with a virtual memory limit set to the same as yours (134217728
).
import pod5
import sys
print("Open file")
pod5_reader = pod5.Reader(sys.argv[1])
print("Opened file")
pod5_reads = pod5_reader.reads()
for i, pod5_read in enumerate(pod5_reads):
signal = pod5_read.signal
if i % 10000 == 0:
print(f"at read {i}")
I have taken out several bits of your script to put this example together... maybe we need to add some of them back to make it crash?
Thanks,
- George
Hi @aCoalBall,
I've been running the below on a system similar to yours and not seen a crash yet... can you confirm this does crash for you?
My input file is 1.4TB, 64GB physical memory, with a virtual memory limit set to the same as yours (
134217728
).import pod5 import sys print("Open file") pod5_reader = pod5.Reader(sys.argv[1]) print("Opened file") pod5_reads = pod5_reader.reads() for i, pod5_read in enumerate(pod5_reads): signal = pod5_read.signal if i % 10000 == 0: print(f"at read {i}")
I have taken out several bits of your script to put this example together... maybe we need to add some of them back to make it crash?
Thanks,
- George
Hi @jorj1988 ,
Actually, even the simplest iteration would cause the error
import pod5
pod5_path = '/home/coalball/projects/pod5/output.pod5'
pod5_reader = pod5.Reader(pod5_path)
pod5_reader._signal_handle._reader = None
pod5_reader._signal_handle._reader = pod5_reader._signal_handle._open_without_mmap()
for read in pod5_reader.reads():
read.signal
Im running this task on a HPC but I don't know how they actually designed it. Im trying to change my computation platform now.