pod5-file-format Q: `pod5 subset <..> --threads` takes a lot of resources

I just started using pod5 subset to regroup my POD5 data per-channel to optimize dorado duplex basecalling speed.

The default setting for --threads is 4, I was brave and set it to 8.. well, ..

System: Linux, 128 cores, 1TB RAM

Load before starting pod5 was around 34 ..

Quickly after starting sth like:

    pod5 subset \
        --threads 8 \
        --force-overwrite \
        --recursive \
        --summary $rc_file \
        --columns channel \
        --output $POD5_TMPDIR \
        $RAW_DATA_DIR

with a small (7Gbyte) P2 dataset, the system load went over 200, making the system a bit sluggish. Heavy I/O I guess .. top showed the 8 python processes, each with 700 to 1200% in different states (R,S,D) ...

top - 21:43:28 up 274 days, 10:38, 29 users,  load average: 214.81, 106.81, 77.11
<...>
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 19971 USER      20   0   22.2g 349288  51752 R  1210   0.0  36:49.68 python3
 19972 USER      20   0   21.6g 288420  51156 S  1041   0.0  37:35.33 python3
 19967 USER      20   0   21.7g 377372  50984 D 989.2   0.0  36:09.18 python3
 19969 USER      20   0   21.7g 277636  48848 R 972.3   0.0  35:51.98 python3
 19974 USER      20   0   21.7g 380280  54068 S 949.0   0.0  38:09.10 python3
 19973 USER      20   0   21.7g 362636  48292 R 943.6   0.0  35:44.70 python3
 19968 USER      20   0   21.6g 286052  48144 S 875.5   0.0  37:15.61 python3
 19970 USER      20   0   21.7g 270932  47140 D 843.0   0.0  36:15.94 python3

This is by far more than I would expect when using --threads 8 ...

Is there a smooth way to better control the CPUs used by pod5 subset and thus I/O throughput (avoiding sth like taskset)? This small dataset took ~7 minutes to finish, the larger datasets are 100x to 200x in size. So I wonder what is considered "best practice"? On shared servers this is quite a big issue.

Instead of writing a few thousand per-channel POD5 files, wouldn't it be convenient to write one or some more large(r) per-channel-sorted POD5 files?

Any ideas/comments/remarks are welcome :-)

Dec 01 '23 21:12 sklages

Now using a large P2 dataset but leaving the --threads on default (4).

This runs out of memory on the same machine, 1TB RAM (~550G free), reformatted for better readability:

### [2023-12-02 07:27:45] START: Merging POD5 files by channel..
Parsed 51501778 targets
memory allocation of 2883584000 bytes failed

(core dumped) pod5 subset 
  --threads 4 
  --force-overwrite 
  --recursive 
  --summary $rc_file 
  --columns channel 
  --output $POD5_TMPDIR 
  $RAW_DATA_DIR

Both RAW_DATA_DIR and POD5_TMPDIR are not on the same storage.

$rc_file is 2GB in size,

$ head pod5_summary_per-channel.tsv
read_id channel
7636a648-a348-4b3a-9925-39587c4dfbbd    2727
0355921b-3da6-45a7-af50-45b9a7e1d93f    1940
bf818b42-7108-4f63-9115-0df5a66c9db6    1572
<...>

$ pod5 -v
Pod5 version: 0.3.2

Is there something very obvious which I am missing here?

Dec 02 '23 07:12 sklages

Hi @sklages , we are in the process of updating subset to use significantly fewer resources.

Thanks for raising this issue. We'll let you know when we push these changes up.

Dec 02 '23 22:12 HalfPhoton