polars
polars copied to clipboard
Save version in parquet metadata
Description
I think when we save a parquet that the version number should be included in the created_by
metadata
For instance
df.write_parquet("temp.parquet")
pq.ParquetFile("temp.parquet").metadata
<pyarrow._parquet.FileMetaData object at 0x7f22e5d4b1a0>
created_by: Polars ## this should have the version
...
The pyarrow writer stores its version number
pq.write_table(df.to_arrow(), "patemp.parquet")
pq.ParquetFile("patemp.parquet").metadata
<pyarrow._parquet.FileMetaData object at 0x7f22e5dc59e0>
created_by: parquet-cpp-arrow version 16.0.0
...
Some legacy parquet reader even raises on ill-formed created_by
field IIRC. It once even have the regex form somewhat like (.+) version (.*)?\(build .*\)
or similar, though I don't think it is of use nowadays.
Found some jira ticket there - once classified as "major bug" lol: https://issues.apache.org/jira/browse/PARQUET-297
If the official Parquet spec states that this field should include a version, then we should include it. But I can't find an official spec stating this. Do you have a link?
I'm also wondering which version we would include. I'm guessing the Rust Polars version?
I don't know about an official spec mandating a version number, it just seemed like a no downside idea. In the future you might want to know at a glance "this file was written before ___ got fixed or ____ got implemented". Of course as we talk about this now we don't know what those future bugs and optimizations might be.
I would think the rust version that way it doesn't matter what binding is used, all the files with the same writer would produce the same version.
If the official Parquet spec states that this field should include a version, then we should include it. But I can't find an official spec stating this. Do you have a link?
The official parquet file format thrift def mandates something for created_by
field, though the field itself is optional so I doubt it is a strong regulation:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1130-L1134
(even the official arrow-cpp writer does not provide build hash!)
@stinodego I see you removed "needs decision" does that mean displaying the rust version is accepted?
Putting there Rust version and Polars version would be a serious sin?
Neither version makes much sense since the releases are not synchronized. A commit hash would be most useful.
Hello. I've been investigating how to approach this issue, and after some research I believe a good approach would be to use a build script which uses the .cargo_vcs_info.json file to embed the commit hash as an env var for usage with the env!
macro, falling back to an execution of git rev-parse --short HEAD
if the file does not exist (e.g. local development of polars).
There is just one issue with this approach: the .cargo_vcs_info.json
file is not generated when packaging/publishing a crate with the --allow-dirty
flag. I am assuming the crates are published with the Makefile in the crates
folder. The flag is present here: is there any specific reason or could it be removed? Thanks in advance.