Empty path passed and invalid Read ID with pod5 view and pod5 subset
Dear Nanopore-Team,
Thank you for providing your work with us. My problem concerns a pipeline that includes multiple commands, as recommended in the Duplex documentation.
I am working on long-read sequencing data, and the goal is to re-base call the generated pod5 files (1-2 TB folder size) using duplex mode to achieve higher read quality. I am using the latest SQK-LSK114 kit. I had great difficulty working with the provided pod5 tools and data of this size, as even the “pod5 view” command took longer than expected.
Problems
My problems can be summarized as follows:
-
The pod5 commands do not terminate but remain as zombie processes, blocking resources on the cluster for an entire night or longer.
-
Despite the almost hard-coded file paths, pod5 subset returns a “FileNotFoundError”; the path passed contains only empty values
-
When checking the performance of pod5 with htop, I saw that 90% of the processes had 0% CPU usage, which prompted me to reduce the threads to only 1
-
Even with
--threads 1, the utilization was similarly low (over 12 hours runtime). -
To reduce the processes generated by the Python multiprocesses, I experimented with 32 to only 1 thread to improve performance, but no noticeable improvement in speed or script success was observed.
-
To reduce the number of processes generated by Python multiprocessing, I experimented with 32 to just 1 thread to improve performance, but no noticeable improvement in speed or script success was observed.
Questions:
What is recommended when working with large sequencing data for tools such as pod5 in a Dorado Duplex workflow? How can I reduce the preprocessing time? Is there a sweet spot where preprocessing takes more time than running Dorado Duplex on the unsplit pod5 directories? How can I avoid path errors and invalid read errors? What could be causing them?
Code
The bash pipeline commands are as following:
#Generating pod5 view summary file for subsetting ...
pod5 view "${BASE_PATH}/${SAMPLE}/${MIDFOLDER}/pod5/" --include "read_id, channel" --output "${BASE_PATH}/${SAMPLE}/${SAMPLE}_pod5View_summary.tsv
# subsetting the pod5 dir with the view-summary
pod5 subset "${BASE_PATH}/${SAMPLE}/${MIDFOLDER}/pod5/" --summary "${BASE_PATH}/${SAMPLE}/${SAMPLE}_pod5View_summary.tsv" --columns channel --recursive --threads 1 --force-overwrite --output "${DEST_PATH}/${SAMPLE}"
#running dorado duplex
/.../dorado/dorado-1.1.1-linux-x64/bin/dorado duplex sup "${TMP_DIR}/" --min-qscore 8 --verbose --device 'cuda:all' --threads 32 --models-directory "/path_to_/dorado/dorado_MODELS" > "${TMP_DIR}/${POD5_NAME}_basecalled_${SLURM_ARRAY_TASK_ID}.duplex.bam"
System Specs
- Pod5 version: 0.3.28
- 4 - 64 CPUS
- 200GB RAM
- linux: "Rocky Linux" on a HPC
Error codes
Generating pod5 view summary file for subsetting ...
Starting subsetting the pod5 files based on channel ...
Outputpath: /path/to/storage/Project_pod5_samples_sorted/14CGUP-2
/path/to/storage/Promethion_mnt_storage/14CGUP-2/20250717_1348_3G_PAO33929_74d22361
pod5 subset /path/to/storage/Promethion_mnt_storage/14CGUP-2/20250717_1348_3G_PAO33929_74d22361/pod5/ --summary /path/to/storage/Promethion_mnt_storage/14CGUP-2/20250717_1348_3G_PAO33929_74d22361/14CGUP-2_pod5View_summary.tsv --columns channel --output /path/to/storage/Project_pod5_samples_sorted/14CGUP-2
Parsed 151984134 targets
Calculated 151984134 transfers
channel-1094.pod5: 83%|########3 | 29698/35681 [01:10<00:14, 418.73Reads/s][A
channel-1094.pod5: 88%|########7 | 31398/35681 [01:15<00:10, 418.11Reads/s][A
channel-1094.pod5: 93%|#########3| 33199/35681 [01:19<00:05, 417.99Reads/s][A
channel-1094.pod5: 100%|##########| 35681/35681 [01:25<00:00, 418.93Reads/s][A
[A
Subsetting: 4%|4 | 115/2670 [4:31:40<100:35:58, 141.75s/Files]Process SpawnProcess-3:
Traceback (most recent call last):
File "/home/user/.conda/envs/bioinf/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/user/.conda/envs/bioinf/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
raise exc
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
ret = func(*args, **kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 513, in process_subset_tasks
task = queue.work.get(timeout=60)
File "/home/user/.conda/envs/bioinf/lib/python3.9/multiprocessing/queues.py", line 114, in get
raise Empty
_queue.Empty
Generating pod5 view summary file for subsetting ...
Starting subsetting the pod5 files based on channel ...
Outputpath: /path/to/storage/Project_pod5_samples_sorted/FR00002AD4
/path/to/storage/Promethion_mnt_storage/FR00002AD4/20250728_1359_1F_PAO33952_af436ae9
pod5 subset /path/to/storage/Promethion_mnt_storage/FR00002AD4/20250728_1359_1F_PAO33952_af436ae9/pod5/ --summary /path/to/storage/Promethion_mnt_storage/FR00002AD4/20250728_1359_1F_PAO33952_af436ae9/FR00002AD4_pod5View_summary.tsv --columns channel --output /path/to/storage/Project_pod5_samples_sorted/FR00002AD4
Parsed 171430773 targets
Calculated 171430773 transfers
self._target(*self._args, **self._kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
raise exc
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
ret = func(*args, **kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 520, in process_subset_tasks
subset_reads(target, sources, process, duplicate_ok)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
raise exc
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
ret = func(*args, **kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 562, in subset_reads
with p5.Reader(Path(source)) as reader:
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 700, in __init__
) = self._open_arrow_table_handles(self._path)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 734, in _open_arrow_table_handles
raise FileNotFoundError(f"Failed to open pod5 file at: {path}")
FileNotFoundError: Failed to open pod5 file at: /path/to/storage/Promethion_mnt_storage/FR00002AD4/20250728_1359_1F_PAO33952_af436ae9/@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Starting subsetting the pod5 files based on channel ...
Outputpath: /path/to/storage/Project_pod5_samples_sorted/FR000345AD
/path/to/storage/Promethion_mnt_storage/FR000345AD/20250728_1359_1G_PAO31783_6ffd9d49
pod5 subset /path/to/storage/Promethion_mnt_storage/FR000345AD/20250728_1359_1G_PAO31783_6ffd9d49/pod5/ --summary /path/to/storage/Promethion_mnt_storage/FR000345AD/20250728_1359_1G_PAO31783_6ffd9d49/FR000345AD_pod5View_summary.tsv --columns channel --output /path/to/storage/Project_pod5_samples_sorted/FR000345AD
Parsed 132073268 targets
Calculated 132073268 transfers
self._target(*self._args, **self._kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
raise exc
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
ret = func(*args, **kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 520, in process_subset_tasks
subset_reads(target, sources, process, duplicate_ok)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
raise exc
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
ret = func(*args, **kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 562, in subset_reads
with p5.Reader(Path(source)) as reader:
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 700, in __init__
) = self._open_arrow_table_handles(self._path)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 734, in _open_arrow_table_handles
raise FileNotFoundError(f"Failed to open pod5 file at: {path}")
FileNotFoundError: Failed to open pod5 file at: /path/to/storage/Promethion_mnt_storage/FR000345AD/20250728_1359_1G_PAO31783_6ffd9d49/
Starting subsetting the pod5 files based on channel ...
Outputpath: /path/to/storage/Project_pod5_samples_sorted/L123
Outputpath temp: /tmp/pod5_subsetting/L123/
pod5 subset /tmp/pod5_subsetting/pod5 --summary /path/to/storage/Promethion_mnt_storage/L123/20250709_1010_3D_PAO33554_61511428/L123_pod5View_summary.tsv --columns channel --recursive --threads 1 --force-overwrite --output /tmp/pod5_subsetting/L123/
Parsed 137626741 targets
Calculated 137626741 transfers
raise exc
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
ret = func(*args, **kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 520, in process_subset_tasks
subset_reads(target, sources, process, duplicate_ok)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
raise exc
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
ret = func(*args, **kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 563, in subset_reads
repacker.add_selected_reads_to_output(output, reader, read_ids)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/repack.py", line 89, in add_selected_reads_to_output
successful_finds, per_batch_counts, all_batch_rows = reader._plan_traversal(
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 1091, in _plan_traversal
read_ids = pack_read_ids(read_ids, invalid_ok=missing_ok)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/api_utils.py", line 38, in pack_read_ids
raise RuntimeError("Invalid read id passed")
RuntimeError: Invalid read id passed
Outputpath: /path/to/storage/Project_pod5_samples_sorted/L126
Outputpath temp: /tmp/pod5_subsetting/L126/
pod5 subset /tmp/pod5_subsetting/pod5 --summary /path/to/storage/Promethion_mnt_storage/L126/20250715_1139_2H_PAO33830_8969d520/L126_pod5View_summary.tsv --columns channel --recursive --threads 1 --force-overwrite --output /tmp/pod5_subsetting/L126/
Parsed 126968283 targets
Calculated 126968283 transfers
raise exc
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
ret = func(*args, **kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 520, in process_subset_tasks
subset_reads(target, sources, process, duplicate_ok)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
raise exc
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
ret = func(*args, **kwargs)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 563, in subset_reads
repacker.add_selected_reads_to_output(output, reader, read_ids)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/repack.py", line 89, in add_selected_reads_to_output
successful_finds, per_batch_counts, all_batch_rows = reader._plan_traversal(
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 1091, in _plan_traversal
read_ids = pack_read_ids(read_ids, invalid_ok=missing_ok)
File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/api_utils.py", line 38, in pack_read_ids
raise RuntimeError("Invalid read id passed")
RuntimeError: Invalid read id passed