Q: `pod5 subset <..> --threads` takes a lot of resources
I just started using pod5 subset to regroup my POD5 data per-channel to optimize dorado duplex basecalling speed.
The default setting for --threads is 4, I was brave and set it to 8.. well, ..
System: Linux, 128 cores, 1TB RAM
Load before starting pod5 was around 34 ..
Quickly after starting sth like:
pod5 subset \
--threads 8 \
--force-overwrite \
--recursive \
--summary $rc_file \
--columns channel \
--output $POD5_TMPDIR \
$RAW_DATA_DIR
with a small (7Gbyte) P2 dataset, the system load went over 200, making the system a bit sluggish. Heavy I/O I guess ..
top showed the 8 python processes, each with 700 to 1200% in different states (R,S,D) ...
top - 21:43:28 up 274 days, 10:38, 29 users, load average: 214.81, 106.81, 77.11
<...>
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19971 USER 20 0 22.2g 349288 51752 R 1210 0.0 36:49.68 python3
19972 USER 20 0 21.6g 288420 51156 S 1041 0.0 37:35.33 python3
19967 USER 20 0 21.7g 377372 50984 D 989.2 0.0 36:09.18 python3
19969 USER 20 0 21.7g 277636 48848 R 972.3 0.0 35:51.98 python3
19974 USER 20 0 21.7g 380280 54068 S 949.0 0.0 38:09.10 python3
19973 USER 20 0 21.7g 362636 48292 R 943.6 0.0 35:44.70 python3
19968 USER 20 0 21.6g 286052 48144 S 875.5 0.0 37:15.61 python3
19970 USER 20 0 21.7g 270932 47140 D 843.0 0.0 36:15.94 python3
This is by far more than I would expect when using --threads 8 ...
Is there a smooth way to better control the CPUs used by pod5 subset and thus I/O throughput (avoiding sth like taskset)?
This small dataset took ~7 minutes to finish, the larger datasets are 100x to 200x in size. So I wonder what is considered "best practice"? On shared servers this is quite a big issue.
Instead of writing a few thousand per-channel POD5 files, wouldn't it be convenient to write one or some more large(r) per-channel-sorted POD5 files?
Any ideas/comments/remarks are welcome :-)
Now using a large P2 dataset but leaving the --threads on default (4).
This runs out of memory on the same machine, 1TB RAM (~550G free), reformatted for better readability:
### [2023-12-02 07:27:45] START: Merging POD5 files by channel..
Parsed 51501778 targets
memory allocation of 2883584000 bytes failed
(core dumped) pod5 subset
--threads 4
--force-overwrite
--recursive
--summary $rc_file
--columns channel
--output $POD5_TMPDIR
$RAW_DATA_DIR
Both RAW_DATA_DIR and POD5_TMPDIR are not on the same storage.
$rc_file is 2GB in size,
$ head pod5_summary_per-channel.tsv
read_id channel
7636a648-a348-4b3a-9925-39587c4dfbbd 2727
0355921b-3da6-45a7-af50-45b9a7e1d93f 1940
bf818b42-7108-4f63-9115-0df5a66c9db6 1572
<...>
$ pod5 -v
Pod5 version: 0.3.2
Is there something very obvious which I am missing here?
Hi @sklages , we are in the process of updating subset to use significantly fewer resources.
Thanks for raising this issue. We'll let you know when we push these changes up.