parquet-format
parquet-format copied to clipboard
PARQUET-2474: Add FIXED_SIZE_LIST logical type
As proposed in https://github.com/apache/arrow/issues/34510 and on ML, PARQUET-2474.
Arrow recently introduced FixedShapeTensor and VariableShapeTensor canonical extension types that use FixedSizeList and StructArray(List, FixedSizeList) as storage respectfully. These are targeted at machine learning and scientific applications that deal with large datasets and would benefit from using Parquet as on disk storage.
However currently FixedSizeList is stored as List in Parquet which adds significant conversion overhead when reading and writing as discussed here. It would therefore be beneficial to introduce a FIXED_SIZE_LIST logical type to Parquet.
cc @wgtmac @tustvold @alippai @mapleFU @AlenkaF
One thing to perhaps give thought to is how this might represent nested lists, say you wanted to encode a m by n matrix, would you just encode this as a m * n list or do we want to support this as a first-class concept?
I had perhaps been anticipating that fixed size list would be a variant of "REPEATED" as opposed to a physical type, that is just able to avoid incrementing the max_def_level and max_rep_level. This would make it significantly more flexible I think, although I concede it will make it harder to implement.
cc @JFinis
Apologies for taking a while to reply.
I've split this into two cases: FixedSizeListType (length is constant) and VariableSizeListType (length differs per row) for the sake of discussion. I would move VariableSizeListType into a separate PR if we even decide it is needed next to ListType.
One thing to perhaps give thought to is how this might represent nested lists, say you wanted to encode a m by n matrix, would you just encode this as a
m * nlist or do we want to support this as a first-class concept?
We could start with a more general multidimensional array definition and have list be a 1 dimensional array. Additional metadata required would not be that bad. I'm just a bit scared of validation and striding logic bleeding into parquet implementations. Do we have any other inputs / opinions?
I had perhaps been anticipating that fixed size list would be a variant of "REPEATED" as opposed to a physical type, that is just able to avoid incrementing the max_def_level and max_rep_level. This would make it significantly more flexible I think, although I concede it will make it harder to implement.
That's interesting. What would you expect performance wise with this approach?
Thanks for the review @etseidl ! I've updated this with your suggestions.
@ritchie46 would this be useful for your new polars Array type?
@rok is there anything I can help with?
@mapleFU I saw your questions above. Are you satisfied with the answers?
@coastalwhite I see you are familiar with Parquet and Array in Polars. Do you think this proposal is useful for your project?
I like the general idea of moving FixedSizeList partially away from List and towards FixedSizeBinary, but I doubt it would lead to serious speedups or simplification possibilities.
The List based deserializer most of the time already batches decoding similarly to what this would allow, although it would allow skipping many checks that happen before the actual deserialization takes place. We would also still need to support the old path for a long time, since a lot of people write parquet files using old versions of the parquet specification and generally use old parquet files.
The one potentially large upside I can imagine of this is getting dictionary encoding for array's, but I am not sure how common that will be in real-world scenarios.
In general, I would say I am in favor. Although, I am not 100% convinced yet that the added complexity will result in significant performance, file size or other benefits.
@coastalwhite there is a 10x penalty in Polars 1.9.0 parquet reading as well using this snippet: https://github.com/apache/arrow/issues/34510#issuecomment-1462753928
@rok is there anything I can help with?
@alippai thanks for pinging. I was advised on the parquet sync call to re-open a ML discussion on this, but I need a couple of weeks to get to it. If you'd like you can start it now, here's the existing thread: https://lists.apache.org/thread/xot5f3ghhtc82n1bf0wdl9zqwlrzqks3 I suppose it'd be useful to report on the pros and cons discussed here and propose we move forward.
@coastalwhite there is a 10x penalty in Polars 1.9.0 parquet reading as well using this snippet: apache/arrow#34510 (comment)
Thank you for putting that to my attention. Still, I feel like that is more of a bug than an inherent performance problem in the Parquet file format. However, it is probably easier to optimize for what is proposed in this PR.
@rok based on the ML discussion we should add the fast path in the cases of polars, arrow and arrow-rs where we know the fixed size already (from schema stored in the metadata or if it's provided by the consumer). This is more fragile and less universal, but maybe a good first step forward
@rok based on the ML discussion we should add the fast path in the cases of polars, arrow and arrow-rs where we know the fixed size already (from schema stored in the metadata or if it's provided by the consumer). This is more fragile and less universal, but maybe a good first step forward
@alippai are you sure we have a strong enough consensus yet to start implementing fast paths? I would really like to have some more discussion before committing.
@rok Sorry, wrong phrasing. I meant that was the recommendation to explore on the ML and by @coastalwhite.
I didn’t see objections adding this feature to the parquet format or commitments for adding the fast path to any of the libraries (arrow cpp actually noted it’s a non-trivial part of the codebase)
Sorry for my abundance of caution @alippai. I'll try to summarize this thread to the ML and ask for some more input ASAP. It would be nice to actually start some work on this.
Some points in no particular order:
- The parquet schema is authoritative, with any other schema information merely a hint, this makes the notion of using the arrow schema, or something else to drive decode a little dubious
- The record shredding logic for lists is the single most complex, confusing and subtle aspect of any parquet reader, which:
- Limits the pool of people who can implement / review such changes
- Sets a very high bar for including such changes
- Even some optimal record shredding setup will never perform better than an implementation that can simply skip it entirely
- Both arrow-rs and polars exploit that the hybrid RLE is effectively a bitmask if the max definition level is only 1, this allows for very efficient decode. This isn't possible when there are repetition levels
- Performant record skipping, e.g. for predicate/index pushdown or late materialization, is not really possible against data with repetition levels 1.
- Many readers have quirky support for repetition levels and lists in general, especially w.r.t areas where the specification has been ambiguous in the past (and some where it still is), finding ways for people to avoid these pain points seems valuable
That's all to say providing a way to encode fixed size lists seems like a very useful capability. That being said, it does seem to be a bit of a hack to make this a logical type, and will potentially limit the options for encodings, statistics, sort orders, etc... In particular the lack of dictionary encoding I could see being a non-trivial sacrifice.
1. In fact I think arrow-rs may be one of the few readers that actually implements it