Inspired by ".npy" and ".npz", I think it would be nice to support a ".posez" format. I have found thus far that using zstd adds almost no time overhead to decompress, and reduces hard drive space requirements by a good chunk, up to 50% in some cases.

I can load the files with:

from pyzstd import decompress
# file_path is a path to .pose.zst
Pose.read(decompress(file_path.read_bytes()))

This causes two issues I have noticed:

Code relying on the .pose extension is complicated by having the double extension (".pose.zst"). For example ".stem" doesn't work right, taking in ".foo.pose.zst" and giving you "foo.pose" instead of the desired "foo".
If you try reading in BOTH a .pose and an .pose.zst in the same session, you get "RuntimeError: PoseHeaderCache hash does not match buffer hash"

I have tried some workarounds, but I think native support of a ".posez" format would be cleaner.

Related to #34

Apr 10 '25 14:04 cleong110

Supporting a .posez would be nice, but to do so we probably should benchmark different compression for speed and size. even more ideally, .posez can specify the compression method, so for example, it starts with one of deflate, gzip, bzip2, lzma, lz4 and then we can support various compressions based on speed - to be less rigid

anyway, if you only care about size on disk, we need to support fp16 instead of just fp32

Apr 11 '25 08:04 AmitMY

I did some benchmarking a while back on bz2, xz and some others. I think all the compression ratios were about the same (1.5ish) except for xz (1.8 ish), but .zst was way faster than any of the others. Let me go find the results!

Apr 11 '25 16:04 cleong110

benchmark_results.csv Here it is.

Apr 11 '25 17:04 cleong110

i would imagine that fast decompression is more important than fast compression, and thus would call .gz the winner here. The only benchmark I need to see to be convinced now is:

write to disk vs compress and write to disk (as in, how much relative time is added/saved by compression)
same for reading (might even be faster because less disk read is required, but it won't allow for reading the data out of sync)

Apr 12 '25 07:04 AmitMY

Ran another, larger benchmark with 1k files each from

PopSign ASL
ASL Citizen
Sem-Lex
YouTube-ASL
YT-SL-25

This time I added ".pose" format, which just reads in the file and writes it back out again to "compress", and same for "decompress"

Also I separately collected stats for just "uncompressed" which is the reading part

Full stats: file_compression_stats.csv

Summary Stats:

Format	File Count	Total Input Size (MB)	Total Compressed Size (MB)	Compression Ratio	Total Compress Time (s)	Total Decompress Time (s)	Mean Compress Time (s)	Mean Decompress Time (s)
.bz2	5000	162744	105196	1.55	17993	7467.7	3.5986	1.4935
.gz	5000	162744	103675	1.57	5236.1	779.74	1.0472	0.1559
.pose	5000	162744	162744	1	390.18	385.55	0.078	0.0771
.xz	5000	162744	88100.3	1.85	40648.7	3540.42	8.1297	0.7081
.zip	5000	162744	103664	1.57	5148.09	691.76	1.0296	0.1384
.zst	5000	162744	103317	1.58	557.2	215.83	0.1114	0.0432
uncompressed	5000	162744	162744	1	222.55	298.38	0.0445	0.0597

Apr 16 '25 14:04 cleong110

Uploaded results for all 5k files to Google Sheets and made a graph:

https://docs.google.com/spreadsheets/d/1GlGAsV1SD4V2hP3n2Zod7zU0PIIFN2zVdnb6XZwflBw/edit?usp=sharing

Looks like:

.zst is the fastest on average for both readinput+compress+write (.11s)/and readcompressed+decompress+write(.04s).
.pose readinput+"compress"+writeoutput and readcompressed+"decompress"+writeoutput is about .08s
pose.read alone takes on average .04s, write takes on average 0.06s

CODE:

Pose.read takes on average .04s:

with open(file, "rb") as f_in:
            pose = Pose.read(f_in)

pose.write takes on average .06s:

with open(raw_out_path, "wb") as f_out:
            pose.write(f_out)

Read binary file + compress + write takes on average 0.08s for .pose, .11s for .zst. That's the time for this code block:

def compress_file(
    input_file: Path, output_file: Path, ext: str, level: int = 5
) -> None:
    with open(input_file, "rb") as f_in:
        if ext == ".zip":
            with zipfile.ZipFile(output_file, "w", zipfile.ZIP_DEFLATED) as zip_out:
                zip_out.write(input_file, arcname=input_file.name)
        elif ext == ".zst":
            with open(output_file, "wb") as f_out:
                pyzstd.compress_stream(f_in, f_out, level_or_option=level)
        elif ext == ".pose":
            # no compression, just write
            pose = Pose.read(f_in)
            # print(pose)
            with open(output_file, "wb") as f_out:
                pose.write(f_out)

Read binary compressed file + decompress +write back out takes on average 0.08s for .pose, .04s for .zst. That's this code block:

def decompress_file(compressed_file: Path, output_file: Path, ext: str) -> None:
    if ext == ".zip":
        with zipfile.ZipFile(compressed_file, "r") as zip_in:
            zip_in.extractall(output_file.parent)
    elif ext == ".zst":
        with open(compressed_file, "rb") as f_in, open(output_file, "wb") as f_out:
            pyzstd.decompress_stream(f_in, f_out)
    elif ext == ".pose":
        with open(compressed_file, "rb") as f_in, open(output_file, "wb") as f_out:
            pose = Pose.read(f_in)
            # print(pose)
            pose.write(f_out)

Attempting to account for shared function calls (e.g. open()), it seems that the .posezst variation would add, I guess, somewhat less than the full 0.11s when writing and 0.08s when reading. Assuming the open() call takes, I dunno, .001s, that would mea the Pose.read() takes 0.03s, the pyzstd.decompress takes 0.02s.

So then the average read time would go up from .04s to 0.06s per file.

Conclusion

Without actually implementing the .posez format, it looks like the overall overhead is not large. Maybe .02s on average or something like that for decompress+read

Apr 16 '25 15:04 cleong110

Cool so seems like .posezst is going to be the winner? I again would vote for

even more ideally, .posez can specify the compression method, so for example, it starts with one of deflate, gzip, bzip2, lzma, lz4 and then we can support various compressions based on speed - to be less rigid

so implement in the write method a first enum (single number to indicate the specific method, so can also add in the future), then pipe the writing through a compression or not.

Apr 16 '25 21:04 AmitMY

Feature request: Support compressed format e.g. ".posez" or ".pose.zst"

Looks like:

CODE:

Conclusion