pod5-file-format pod5 subset cannot shard a large aggregate pod5 file

Using python 3.7 and pod5 0.2.4. I'm trying to use pod5 subset to break up a large pod5 file into smaller ones, but it crashes with a strange error.

I used pod view to generate a reads table and split on the "channel" column:

nohup pod5 inspect reads converted.pod5 > inspect_reads.txt &

First 10k lines:

read_id filename        read_number     channel mux     end_reason      start_time      start_sample    duration        num_samples     minknow_events  sample_rate     median_before   predicted_scaling_scale predicted_scaling_shift   tracked_scaling_scale   tracked_scaling_shift   num_reads_since_mux_change      time_since_mux_change   run_id  sample_id       experiment_id   flow_cell_id    pore_type
001e35eb-1c55-4c7b-8886-dcdcb6dca15f    converted.pod5  74306   148     4       signal_positive 22092.86475000  88371459        0.64550000      2582    144     4000    205.11199951    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
00356992-b31c-49a9-bb54-b5f04e8feccb    converted.pod5  67712   850     4       signal_positive 22086.55175000  88346207        0.71600000      2864    142     4000    208.46253967    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
00399263-1258-42df-be8e-623435a5e3b0    converted.pod5  68784   1138    3       signal_positive 22092.13000000  88368520        0.93625000      3745    195     4000    205.83134460    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
0046e652-310e-42e8-a9eb-a97afe61cd6c    converted.pod5  71090   2653    3       signal_positive 22093.76100000  88375044        0.94250000      3770    201     4000    204.88726807    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
004f3d41-1063-43d5-a2fe-ab70094d784c    converted.pod5  74144   1178    3       signal_positive 22089.63250000  88358530        0.90700000      3628    194     4000    208.21510315    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
00625954-411f-4c18-ad6d-f7d0b5234f84    converted.pod5  65143   1218    1       signal_positive 22094.15175000  88376607        0.78825000      3153    163     4000    213.10104370    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
007ff853-7128-4f74-bb12-88a31ac23205    converted.pod5  46458   2338    1       signal_positive 22088.29975000  88353199        0.67450000      2698    152     4000    210.64157104    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
0080d4a1-8abd-4a06-9a9a-e6b8d856122a    converted.pod5  74703   1140    4       signal_positive 22090.70975000  88362839        1.09300000      4372    258     4000    206.32232666    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
0089e982-2261-4a08-9945-07b25a7f7214    converted.pod5  67244   56      1       signal_positive 22091.08575000  88364343        1.03300000      4132    196     4000    211.88562012    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set

Then I run subset:

nohup pod5 subset pod5/*.pod5 -o pod5_subset/ -f -r -s pod5/inspect_reads.head.txt -c channel &

It crashes with:

Parsed 9999 targets
Traceback (most recent call last):
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

POD5 has encountered an error: ''

For detailed information set POD5_DEBUG=1'

Sep 17 '23 18:09 billytcl

Thats a new one - thanks @billytcl,

We will take a look and try to work out whats going on internally...

George

Sep 18 '23 07:09 0x55555555

Thanks! I have an immediate need for this so fingers crossed that it's not a terrible bug.

On Mon, Sep 18, 2023 at 12:27 AM jorj1988 @.***> wrote:

Thats a new one - thanks @billytcl https://github.com/billytcl,

We will take a look and try to work out whats going on internally...

George

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/pod5-file-format/issues/72#issuecomment-1722878989, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYT2MGPRH4BEI7FFH23DX27ZVVANCNFSM6AAAAAA43ZF5TQ . You are receiving this because you were mentioned.Message ID: @.***>

Sep 18 '23 07:09 billytcl

A short term workaround might be to work in smaller batches, if that helps?

If you are able to rerun with POD5_DEBUG=1 set during the execution it may provide us more information to debug.

How large was the input dataset (and approximate read lengths) - so I can ensure we have an equivalent internal dataset to test on?

Thanks,

George

Sep 18 '23 07:09 0x55555555

Unfortunately I can't work in smaller batches as the pod5 was generated as an aggregate of an entire run's fast5s (using pod5 convert fast5, without the one-to-one option). Ironically I was trying to use subset to break it into smaller batches grouped by channel! I am repodding all my fast5s using the one-to-one option, but that takes a decent amount of time to go through all of our runs.

The input dataset is ~30-90k fast5 files, so that could be anywhere from 500G to 2TB. Read lengths should be short -- ~170bp on average.

On Mon, Sep 18, 2023 at 12:32 AM jorj1988 @.***> wrote:

A short term workaround might be to work in smaller batches, if that helps?

If you are able to rerun with POD5_DEBUG=1 set during the execution it may provide us more information to debug.

How large was the input dataset (and approximate read lengths) - so I can ensure we have an equivalent internal dataset to test on?

Thanks,

George

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/pod5-file-format/issues/72#issuecomment-1722884886, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYT7ZVN4T2MMYAXX3IJTX272HFANCNFSM6AAAAAA43ZF5TQ . You are receiving this because you were mentioned.Message ID: @.***>

Sep 18 '23 07:09 billytcl

Ok - no worries.

Could you try using view rather than inspect - this will cut down the size of the input csv to subset:

pod5 view --include "read_id, channel" converted.pod5

Should produce a significantly smaller file for subset to process.

Sep 18 '23 08:09 0x55555555

Going to try the pod5 view with the reduced columns now. In the meantime here are the logs with debug mode:

2023-09-18--00-55-28-p-3088300-pod5.log 2023-09-18--00-55-26-main-pod5.log

Alongside is the error to nohup:

Parsed 9999 targets
Traceback (most recent call last):
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB
Traceback (most recent call last):
  File "/home/billylau/.conda/envs/pod5/bin/pod5", line 8, in <module>
    sys.exit(main())
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/main.py", line 60, in main
    return run_tool(parser)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 41, in run_tool
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 38, in run_tool
    return tool_func(**kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 564, in run
    return subset_pod5(**kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
    ret = func(*args, **kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 647, in subset_pod5
    force_overwrite=force_overwrite,
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
    ret = func(*args, **kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 585, in subset_pod5s_with_mapping
    sources_df = parse_sources(inputs, duplicate_ok, threads)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
    ret = func(*args, **kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 321, in parse_sources
    items.append(parsed_sources.get(timeout=60))
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 105, in get
    raise Empty
_queue.Empty

Sep 18 '23 08:09 billytcl