polars icon indicating copy to clipboard operation
polars copied to clipboard

Save version in parquet metadata

Open deanm0000 opened this issue 10 months ago • 8 comments

Description

I think when we save a parquet that the version number should be included in the created_by metadata

For instance

df.write_parquet("temp.parquet")
pq.ParquetFile("temp.parquet").metadata
<pyarrow._parquet.FileMetaData object at 0x7f22e5d4b1a0>
  created_by: Polars ## this should have the version
...

The pyarrow writer stores its version number

pq.write_table(df.to_arrow(), "patemp.parquet")
pq.ParquetFile("patemp.parquet").metadata
<pyarrow._parquet.FileMetaData object at 0x7f22e5dc59e0>
  created_by: parquet-cpp-arrow version 16.0.0
...

deanm0000 avatar Apr 26 '24 13:04 deanm0000

Some legacy parquet reader even raises on ill-formed created_by field IIRC. It once even have the regex form somewhat like (.+) version (.*)?\(build .*\) or similar, though I don't think it is of use nowadays.

cjackal avatar Apr 26 '24 14:04 cjackal

Found some jira ticket there - once classified as "major bug" lol: https://issues.apache.org/jira/browse/PARQUET-297

cjackal avatar Apr 26 '24 14:04 cjackal

If the official Parquet spec states that this field should include a version, then we should include it. But I can't find an official spec stating this. Do you have a link?

I'm also wondering which version we would include. I'm guessing the Rust Polars version?

stinodego avatar Apr 26 '24 17:04 stinodego

I don't know about an official spec mandating a version number, it just seemed like a no downside idea. In the future you might want to know at a glance "this file was written before ___ got fixed or ____ got implemented". Of course as we talk about this now we don't know what those future bugs and optimizations might be.

I would think the rust version that way it doesn't matter what binding is used, all the files with the same writer would produce the same version.

deanm0000 avatar Apr 26 '24 17:04 deanm0000

If the official Parquet spec states that this field should include a version, then we should include it. But I can't find an official spec stating this. Do you have a link?

The official parquet file format thrift def mandates something for created_by field, though the field itself is optional so I doubt it is a strong regulation: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1130-L1134 (even the official arrow-cpp writer does not provide build hash!)

cjackal avatar Apr 27 '24 16:04 cjackal

@stinodego I see you removed "needs decision" does that mean displaying the rust version is accepted?

deanm0000 avatar Apr 30 '24 21:04 deanm0000

Putting there Rust version and Polars version would be a serious sin?

lmocsi avatar May 01 '24 18:05 lmocsi

Neither version makes much sense since the releases are not synchronized. A commit hash would be most useful.

stinodego avatar May 01 '24 20:05 stinodego

Hello. I've been investigating how to approach this issue, and after some research I believe a good approach would be to use a build script which uses the .cargo_vcs_info.json file to embed the commit hash as an env var for usage with the env! macro, falling back to an execution of git rev-parse --short HEAD if the file does not exist (e.g. local development of polars).

There is just one issue with this approach: the .cargo_vcs_info.json file is not generated when packaging/publishing a crate with the --allow-dirty flag. I am assuming the crates are published with the Makefile in the crates folder. The flag is present here: is there any specific reason or could it be removed? Thanks in advance.

Carbonhell avatar May 21 '24 15:05 Carbonhell