pod5-file-format icon indicating copy to clipboard operation
pod5-file-format copied to clipboard

Empty path passed and invalid Read ID with pod5 view and pod5 subset

Open leonard-creator opened this issue 1 month ago • 0 comments

Dear Nanopore-Team,

Thank you for providing your work with us. My problem concerns a pipeline that includes multiple commands, as recommended in the Duplex documentation.

I am working on long-read sequencing data, and the goal is to re-base call the generated pod5 files (1-2 TB folder size) using duplex mode to achieve higher read quality. I am using the latest SQK-LSK114 kit. I had great difficulty working with the provided pod5 tools and data of this size, as even the “pod5 view” command took longer than expected.

Problems

My problems can be summarized as follows:

  • The pod5 commands do not terminate but remain as zombie processes, blocking resources on the cluster for an entire night or longer.

  • Despite the almost hard-coded file paths, pod5 subset returns a “FileNotFoundError”; the path passed contains only empty values

  • When checking the performance of pod5 with htop, I saw that 90% of the processes had 0% CPU usage, which prompted me to reduce the threads to only 1

  • Even with --threads 1, the utilization was similarly low (over 12 hours runtime).

  • To reduce the processes generated by the Python multiprocesses, I experimented with 32 to only 1 thread to improve performance, but no noticeable improvement in speed or script success was observed.

  • To reduce the number of processes generated by Python multiprocessing, I experimented with 32 to just 1 thread to improve performance, but no noticeable improvement in speed or script success was observed.

Questions:

What is recommended when working with large sequencing data for tools such as pod5 in a Dorado Duplex workflow? How can I reduce the preprocessing time? Is there a sweet spot where preprocessing takes more time than running Dorado Duplex on the unsplit pod5 directories? How can I avoid path errors and invalid read errors? What could be causing them?

Code

The bash pipeline commands are as following: #Generating pod5 view summary file for subsetting ... pod5 view "${BASE_PATH}/${SAMPLE}/${MIDFOLDER}/pod5/" --include "read_id, channel" --output "${BASE_PATH}/${SAMPLE}/${SAMPLE}_pod5View_summary.tsv

# subsetting the pod5 dir with the view-summary pod5 subset "${BASE_PATH}/${SAMPLE}/${MIDFOLDER}/pod5/" --summary "${BASE_PATH}/${SAMPLE}/${SAMPLE}_pod5View_summary.tsv" --columns channel --recursive --threads 1 --force-overwrite --output "${DEST_PATH}/${SAMPLE}"

#running dorado duplex /.../dorado/dorado-1.1.1-linux-x64/bin/dorado duplex sup "${TMP_DIR}/" --min-qscore 8 --verbose --device 'cuda:all' --threads 32 --models-directory "/path_to_/dorado/dorado_MODELS" > "${TMP_DIR}/${POD5_NAME}_basecalled_${SLURM_ARRAY_TASK_ID}.duplex.bam"

System Specs

  • Pod5 version: 0.3.28
  • 4 - 64 CPUS
  • 200GB RAM
  • linux: "Rocky Linux" on a HPC

Error codes

Generating pod5 view summary file for subsetting ...
Starting subsetting the pod5 files based on channel ...
Outputpath: /path/to/storage/Project_pod5_samples_sorted/14CGUP-2
/path/to/storage/Promethion_mnt_storage/14CGUP-2/20250717_1348_3G_PAO33929_74d22361
pod5 subset /path/to/storage/Promethion_mnt_storage/14CGUP-2/20250717_1348_3G_PAO33929_74d22361/pod5/ --summary /path/to/storage/Promethion_mnt_storage/14CGUP-2/20250717_1348_3G_PAO33929_74d22361/14CGUP-2_pod5View_summary.tsv --columns channel --output /path/to/storage/Project_pod5_samples_sorted/14CGUP-2
Parsed 151984134 targets
Calculated 151984134 transfers

channel-1094.pod5:  83%|########3 | 29698/35681 [01:10<00:14, 418.73Reads/s][A

channel-1094.pod5:  88%|########7 | 31398/35681 [01:15<00:10, 418.11Reads/s][A

channel-1094.pod5:  93%|#########3| 33199/35681 [01:19<00:05, 417.99Reads/s][A

channel-1094.pod5: 100%|##########| 35681/35681 [01:25<00:00, 418.93Reads/s][A

                                                                            [A
Subsetting:   4%|4         | 115/2670 [4:31:40<100:35:58, 141.75s/Files]Process SpawnProcess-3:
Traceback (most recent call last):
  File "/home/user/.conda/envs/bioinf/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/user/.conda/envs/bioinf/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
    raise exc
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
    ret = func(*args, **kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 513, in process_subset_tasks
    task = queue.work.get(timeout=60)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/multiprocessing/queues.py", line 114, in get
    raise Empty
_queue.Empty


Generating pod5 view summary file for subsetting ...
Starting subsetting the pod5 files based on channel ...
Outputpath: /path/to/storage/Project_pod5_samples_sorted/FR00002AD4
/path/to/storage/Promethion_mnt_storage/FR00002AD4/20250728_1359_1F_PAO33952_af436ae9
pod5 subset /path/to/storage/Promethion_mnt_storage/FR00002AD4/20250728_1359_1F_PAO33952_af436ae9/pod5/ --summary /path/to/storage/Promethion_mnt_storage/FR00002AD4/20250728_1359_1F_PAO33952_af436ae9/FR00002AD4_pod5View_summary.tsv --columns channel --output /path/to/storage/Project_pod5_samples_sorted/FR00002AD4
Parsed 171430773 targets
Calculated 171430773 transfers
    self._target(*self._args, **self._kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
    raise exc
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
    ret = func(*args, **kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 520, in process_subset_tasks
    subset_reads(target, sources, process, duplicate_ok)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
    raise exc
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
    ret = func(*args, **kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 562, in subset_reads
    with p5.Reader(Path(source)) as reader:
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 700, in __init__
    ) = self._open_arrow_table_handles(self._path)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 734, in _open_arrow_table_handles
    raise FileNotFoundError(f"Failed to open pod5 file at: {path}")
FileNotFoundError: Failed to open pod5 file at: /path/to/storage/Promethion_mnt_storage/FR00002AD4/20250728_1359_1F_PAO33952_af436ae9/@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Starting subsetting the pod5 files based on channel ...
Outputpath: /path/to/storage/Project_pod5_samples_sorted/FR000345AD
/path/to/storage/Promethion_mnt_storage/FR000345AD/20250728_1359_1G_PAO31783_6ffd9d49
pod5 subset /path/to/storage/Promethion_mnt_storage/FR000345AD/20250728_1359_1G_PAO31783_6ffd9d49/pod5/ --summary /path/to/storage/Promethion_mnt_storage/FR000345AD/20250728_1359_1G_PAO31783_6ffd9d49/FR000345AD_pod5View_summary.tsv --columns channel --output /path/to/storage/Project_pod5_samples_sorted/FR000345AD
Parsed 132073268 targets
Calculated 132073268 transfers
    self._target(*self._args, **self._kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
    raise exc
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
    ret = func(*args, **kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 520, in process_subset_tasks
    subset_reads(target, sources, process, duplicate_ok)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
    raise exc
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
    ret = func(*args, **kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 562, in subset_reads
    with p5.Reader(Path(source)) as reader:
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 700, in __init__
    ) = self._open_arrow_table_handles(self._path)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 734, in _open_arrow_table_handles
    raise FileNotFoundError(f"Failed to open pod5 file at: {path}")
FileNotFoundError: Failed to open pod5 file at: /path/to/storage/Promethion_mnt_storage/FR000345AD/20250728_1359_1G_PAO31783_6ffd9d49/
Starting subsetting the pod5 files based on channel ...
Outputpath: /path/to/storage/Project_pod5_samples_sorted/L123
Outputpath temp: /tmp/pod5_subsetting/L123/
pod5 subset /tmp/pod5_subsetting/pod5 --summary /path/to/storage/Promethion_mnt_storage/L123/20250709_1010_3D_PAO33554_61511428/L123_pod5View_summary.tsv --columns channel --recursive --threads 1 --force-overwrite --output /tmp/pod5_subsetting/L123/
Parsed 137626741 targets
Calculated 137626741 transfers
    raise exc
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
    ret = func(*args, **kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 520, in process_subset_tasks
    subset_reads(target, sources, process, duplicate_ok)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
    raise exc
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
    ret = func(*args, **kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 563, in subset_reads
    repacker.add_selected_reads_to_output(output, reader, read_ids)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/repack.py", line 89, in add_selected_reads_to_output
    successful_finds, per_batch_counts, all_batch_rows = reader._plan_traversal(
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 1091, in _plan_traversal
    read_ids = pack_read_ids(read_ids, invalid_ok=missing_ok)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/api_utils.py", line 38, in pack_read_ids
    raise RuntimeError("Invalid read id passed")
RuntimeError: Invalid read id passed

Outputpath: /path/to/storage/Project_pod5_samples_sorted/L126
Outputpath temp: /tmp/pod5_subsetting/L126/
pod5 subset /tmp/pod5_subsetting/pod5 --summary /path/to/storage/Promethion_mnt_storage/L126/20250715_1139_2H_PAO33830_8969d520/L126_pod5View_summary.tsv --columns channel --recursive --threads 1 --force-overwrite --output /tmp/pod5_subsetting/L126/
Parsed 126968283 targets
Calculated 126968283 transfers
    raise exc
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
    ret = func(*args, **kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 520, in process_subset_tasks
    subset_reads(target, sources, process, duplicate_ok)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 63, in wrapper
    raise exc
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/utils.py", line 60, in wrapper
    ret = func(*args, **kwargs)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/tools/pod5_subset.py", line 563, in subset_reads
    repacker.add_selected_reads_to_output(output, reader, read_ids)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/repack.py", line 89, in add_selected_reads_to_output
    successful_finds, per_batch_counts, all_batch_rows = reader._plan_traversal(
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/reader.py", line 1091, in _plan_traversal
    read_ids = pack_read_ids(read_ids, invalid_ok=missing_ok)
  File "/home/user/.conda/envs/bioinf/lib/python3.9/site-packages/pod5/api_utils.py", line 38, in pack_read_ids
    raise RuntimeError("Invalid read id passed")
RuntimeError: Invalid read id passed

leonard-creator avatar Oct 27 '25 12:10 leonard-creator