parquet-java
parquet-java copied to clipboard
optimize the storage for repetition level and definition level
This update optimizes the storage for all-null fields.
In some cases there are thousands of fields, which most of them are null in one data block, and the null value does not occur on the same nesting level. The repetition level and definition level of null fields vary from 0, 1, 2, etc, and waste appreciable storage space.
In this update, writer skips writing chunk page if all data in the field are null, and reader is compatible for the change.
I'm extremely -1 on this change as this will generate Parquet files that cannot be read anymore by older implementations. Also we discard here information that may be relevant in some cases.
As mentioned in the corresponding ticket: Please ensure that you have null-values all on the same level if you want to skip the storage of this.
It definitely will make sense to have an option to ignore this information as it is not relevant for you but that should come in the form of being able to force all NULLs to be occuring on the same level, this may be done in Parquet itself or the object model. This approach would then yield much smaller files that can also be read by existing Parquet versions.
Thanks for xhochy's comments. I am sorry for missing the compatibility for older implementations of reader, and is ok to add an option to make old version compatible. I have no idea on how to force all NULLs to be occuring on the same level. Could you explain more about that?
@lirui-tx The main issue here is the format incompatibility and the correctness issue (since "null" doesn't capture at what level it is null) If you don't care about the level of the null in your data, you can preprocess the data to make it null at the same level everywhere. Since the definition level is RLE encoded you would get the same storage savings.