pod5-file-format icon indicating copy to clipboard operation
pod5-file-format copied to clipboard

Feature request / question: Convert VBZ-compressed pod5 to ZSTD-compressed pod5 not supported in 0.3.34

Open Taylorain opened this issue 1 month ago • 6 comments

Description

I am trying to convert an existing .pod5 file that uses VBZ compression to a .pod5 file using ZSTD compression using the pod5 convert CLI.

For example, I tried:

pod5 convert to_pod5 --compression zstd input.pod5 output.zstd.pod5

or

pod5 convert --compression zstd input.pod5 output.zstd.pod5

But I always get the following error:

usage: pod5 convert [-h] {fast5,from_fast5,to_fast5} ...
pod5 convert: error: argument {fast5,from_fast5,to_fast5}: invalid choice: 'to_pod5' (choose from 'fast5', 'from_fast5', 'to_fast5')

Environment

  • pod5 version: 0.3.34
  • Python version: 3.12
  • OS: Linux
  • Source: Installed via pip from Tsinghua mirror
  • Input file: VBZ-compressed pod5
  • Goal: ZSTD-compressed pod5 for downstream Dorado basecalling

Question

  1. Is there currently any CLI way to convert an existing VBZ-compressed .pod5 to ZSTD-compressed .pod5?
  2. If not, is this planned for a future release?
  3. Would the recommended approach be using the Python API with Reader/Writer to re-compress?

Additional context

  • I have the original .pod5 file, but not the original .fast5 files anymore.
  • I want to avoid re-basecalling from scratch if possible.

Thank you for your guidance!

Taylorain avatar Oct 27 '25 11:10 Taylorain

Hello @Taylorain ,

Pod5 does not and has never supported the VBZ compression directly. It has always used a derived VBZ-like compression style, which is slightly evolved from the original fast5 VBZ. Both VBZ (in fast5) and the pod5 VBZ-like compression use zstd internally as part of the compression.

Can I ask what makes you think you have a VBZ-compressed pod5, and where you have seen zstd compression as an option for pod5?

What were you originally trying to achieve?

Thanks,

  • George

0x55555555 avatar Oct 27 '25 12:10 0x55555555

Hi George, I encountered issues while using Dorado for basecalling. Two different POD5 files have thrown zstd-related decompression errors, and eventually resulted in a segmentation fault. Below are the key details from the logs:

  1. Issue with File: 3.basecall_5mC_XY/09/pod5/20250522-SSL0495-PBE62452-P05.pass.pod5 Debug Log: The file started loading at [2025-10-26 23:50:05.932]. Error Details: From read 29 onwards, decompression errors occurred: Read 29: "Input data failed to decompress using zstd: (18446744073709551596 Data corruption detected)". Reads 30–43: Consistent errors of "Input data not compressed by zstd: (18446744073709551614 Unspecified error code)".

  2. Issue with File: pod5/pod5_pass/20250220-SSL0373-PAW85634-P05.pass.pod5 Debug Log: The file started loading at [2025-10-29 13:47:23.671]. Error Details: Read 383: "Input data failed to decompress using zstd: (18446744073709551596 Data corruption detected)". Read 384: "Input data not compressed by zstd: (18446744073709551614 Unspecified error code)". Critical Follow-up: After the above errors, a segmentation fault occurred: "/bio/0.consummate/02.dodo/: line 69: 187881 Segmentation fault". Could you please help investigate the root cause of these zstd decompression failures and the subsequent segmentation fault? Let me know if you need additional logs or system configuration details. Thanks, Taylorain

Taylorain avatar Oct 30 '25 03:10 Taylorain

Hi @Taylorain,

How have the pod5 files been managed since writing, are you sure they have not become corrupted during copy between machines?

The messages above indicate to me the files have become corrupt.

Thanks,

  • George

0x55555555 avatar Oct 30 '25 08:10 0x55555555

I’m not entirely sure — could you please advise on how to check the integrity of the pod5 files? I’d like to verify whether they have indeed become corrupted during transfer.

Taylorain avatar Oct 30 '25 13:10 Taylorain

Hi @Taylorain,

MinKNOW and pod5 files do not currently have integrity checking built in. You would need to run some kind of checksum (eg. md5sum on your files at source where they were generated, and run it again on the file you are basecalling to check the file is identical.

We are actively working on bringing integrity checking to the outputs of MinKNOW (some hashing of output types have been released with MinKNOW 25.09 yesterday), but this is something you have to do you on your own for pod5 right now.

Thanks,

  • George

0x55555555 avatar Oct 30 '25 13:10 0x55555555

Thank you pretty much!

Taylorain avatar Oct 30 '25 13:10 Taylorain