Feature request: Support compressed format e.g. ".posez" or ".pose.zst"
Inspired by ".npy" and ".npz", I think it would be nice to support a ".posez" format. I have found thus far that using zstd adds almost no time overhead to decompress, and reduces hard drive space requirements by a good chunk, up to 50% in some cases.
I can load the files with:
from pyzstd import decompress
# file_path is a path to .pose.zst
Pose.read(decompress(file_path.read_bytes()))
This causes two issues I have noticed:
- Code relying on the .pose extension is complicated by having the double extension (".pose.zst"). For example ".stem" doesn't work right, taking in ".foo.pose.zst" and giving you "foo.pose" instead of the desired "foo".
- If you try reading in BOTH a .pose and an .pose.zst in the same session, you get "RuntimeError: PoseHeaderCache hash does not match buffer hash"
I have tried some workarounds, but I think native support of a ".posez" format would be cleaner.
Related to #34
Supporting a .posez would be nice, but to do so we probably should benchmark different compression for speed and size.
even more ideally, .posez can specify the compression method, so for example, it starts with one of deflate, gzip, bzip2, lzma, lz4 and then we can support various compressions based on speed - to be less rigid
anyway, if you only care about size on disk, we need to support fp16 instead of just fp32
I did some benchmarking a while back on bz2, xz and some others. I think all the compression ratios were about the same (1.5ish) except for xz (1.8 ish), but .zst was way faster than any of the others. Let me go find the results!
i would imagine that fast decompression is more important than fast compression, and thus would call .gz the winner here.
The only benchmark I need to see to be convinced now is:
- write to disk vs compress and write to disk (as in, how much relative time is added/saved by compression)
- same for reading (might even be faster because less disk read is required, but it won't allow for reading the data out of sync)
Ran another, larger benchmark with 1k files each from
- PopSign ASL
- ASL Citizen
- Sem-Lex
- YouTube-ASL
- YT-SL-25
This time I added ".pose" format, which just reads in the file and writes it back out again to "compress", and same for "decompress"
Also I separately collected stats for just "uncompressed" which is the reading part
Full stats: file_compression_stats.csv
Summary Stats:
| Format | File Count | Total Input Size (MB) | Total Compressed Size (MB) | Compression Ratio | Total Compress Time (s) | Total Decompress Time (s) | Mean Compress Time (s) | Mean Decompress Time (s) |
|---|---|---|---|---|---|---|---|---|
| .bz2 | 5000 | 162744 | 105196 | 1.55 | 17993 | 7467.7 | 3.5986 | 1.4935 |
| .gz | 5000 | 162744 | 103675 | 1.57 | 5236.1 | 779.74 | 1.0472 | 0.1559 |
| .pose | 5000 | 162744 | 162744 | 1 | 390.18 | 385.55 | 0.078 | 0.0771 |
| .xz | 5000 | 162744 | 88100.3 | 1.85 | 40648.7 | 3540.42 | 8.1297 | 0.7081 |
| .zip | 5000 | 162744 | 103664 | 1.57 | 5148.09 | 691.76 | 1.0296 | 0.1384 |
| .zst | 5000 | 162744 | 103317 | 1.58 | 557.2 | 215.83 | 0.1114 | 0.0432 |
| uncompressed | 5000 | 162744 | 162744 | 1 | 222.55 | 298.38 | 0.0445 | 0.0597 |
Uploaded results for all 5k files to Google Sheets and made a graph:
https://docs.google.com/spreadsheets/d/1GlGAsV1SD4V2hP3n2Zod7zU0PIIFN2zVdnb6XZwflBw/edit?usp=sharing
Looks like:
- .zst is the fastest on average for both readinput+compress+write (.11s)/and readcompressed+decompress+write(.04s).
- .pose readinput+"compress"+writeoutput and readcompressed+"decompress"+writeoutput is about .08s
- pose.read alone takes on average .04s, write takes on average 0.06s
CODE:
Pose.read takes on average .04s:
with open(file, "rb") as f_in:
pose = Pose.read(f_in)
pose.write takes on average .06s:
with open(raw_out_path, "wb") as f_out:
pose.write(f_out)
Read binary file + compress + write takes on average 0.08s for .pose, .11s for .zst. That's the time for this code block:
def compress_file(
input_file: Path, output_file: Path, ext: str, level: int = 5
) -> None:
with open(input_file, "rb") as f_in:
if ext == ".zip":
with zipfile.ZipFile(output_file, "w", zipfile.ZIP_DEFLATED) as zip_out:
zip_out.write(input_file, arcname=input_file.name)
elif ext == ".zst":
with open(output_file, "wb") as f_out:
pyzstd.compress_stream(f_in, f_out, level_or_option=level)
elif ext == ".pose":
# no compression, just write
pose = Pose.read(f_in)
# print(pose)
with open(output_file, "wb") as f_out:
pose.write(f_out)
Read binary compressed file + decompress +write back out takes on average 0.08s for .pose, .04s for .zst. That's this code block:
def decompress_file(compressed_file: Path, output_file: Path, ext: str) -> None:
if ext == ".zip":
with zipfile.ZipFile(compressed_file, "r") as zip_in:
zip_in.extractall(output_file.parent)
elif ext == ".zst":
with open(compressed_file, "rb") as f_in, open(output_file, "wb") as f_out:
pyzstd.decompress_stream(f_in, f_out)
elif ext == ".pose":
with open(compressed_file, "rb") as f_in, open(output_file, "wb") as f_out:
pose = Pose.read(f_in)
# print(pose)
pose.write(f_out)
Attempting to account for shared function calls (e.g. open()), it seems that the .posezst variation would add, I guess, somewhat less than the full 0.11s when writing and 0.08s when reading. Assuming the open() call takes, I dunno, .001s, that would mea the Pose.read() takes 0.03s, the pyzstd.decompress takes 0.02s.
So then the average read time would go up from .04s to 0.06s per file.
Conclusion
Without actually implementing the .posez format, it looks like the overall overhead is not large. Maybe .02s on average or something like that for decompress+read
Cool so seems like .posezst is going to be the winner?
I again would vote for
even more ideally,
.posezcan specify the compression method, so for example, it starts with one ofdeflate, gzip, bzip2, lzma, lz4and then we can support various compressions based on speed - to be less rigid
so implement in the write method a first enum (single number to indicate the specific method, so can also add in the future), then pipe the writing through a compression or not.