parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

PARQUET-2249: Add nan_count to handle NaNs in statistics

Open JFinis opened this issue 2 years ago • 30 comments
trafficstars

This commit proposes an improvement for handling of NaN values in FLOAT and DOUBLE type columns. The goal is to allow reading engines, regardless of how they order NaN w.r.t. other values, to efficiently use statistics for scan pruning on those columns, which currently is impossible in most cases. This is to be accomplished in a fully backward compatible manner, so that existing readers and writers do not need to be updated immediatly but can migrate over time to make use of the improved semantics.

There was already work on improving the handling of float and double columns which laid good ground work for backward compatible improvements, but it wasn't sufficient to fix all the problems with NaN values, which are laid out hereinafter.

Problem Description

Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist:

Statistics don't tell whether NaNs are present

As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found here. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values.

For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate x > 1.0 should include NaN values. If a page has max = 0.0 now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result.

Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with min = 2.0 couldn't be skipped in this case by the reader.

Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight.

Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as state, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values.

Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present.

Handling NaNs in a ColumnIndex

There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we have to put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent writing a column index for the column chunk containing that page, so even pages in that chunk that don't contain NaNs will not be indexed.

It would be nice if there was a defined way of writing the ColumnIndex when NaNs (and especially only-NaN pages) are present.

Handling only-NaN pages & column chunks

Note: Hereinafter, whenever the term only-NaN is used, it refers to a page or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values.

The Statistics objects stored in page headers and in the file footer have a similar, albeit smaller problem: min_value and max_value are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional.

Consequently, a reader cannot know whether missing min_value and max_value means "only NaNs, you can skip this page if you are looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains".

It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages.

Proposed solution

Adding NaN counts

The proposed solution for handling NaNs in statistics is akin to what Apache Iceberg does: add an optional nan_count field to Statistics and an optional nan_counts list to ColumnIndex. This way, all places where statistics are being retained can specify whether NaN values are present. This already solves the first problem, as now queries wanting to retain NaNs can check whether the count is > 0 to see whether a page or column chunk contains NaNs.

Handling NaN-only pages & column chunks

Adding nan_count/nan_counts fields does not solve the problem of only-NaN pages, yet. But since we have a new optional field in both the Statistics object and the ColumnIndex object, we can tie a stricter semantics to the existence of this field. I.e., we can mandate that writers who write this optional field have to treat NaNs in a specific way.

We basically have two options for treating only-NaN pages or column chunks:

  • to write NaN as min/max in this case.
  • to write nothing, i.e.,
    • omit the min_value / max_value in Statistics
    • write byte[0] in the min_values/max_values list entry of the ColumnIndex

I propose to go with the first option of writing NaN as min/max in case of only-Nan pages & column chunks. A section depicting the decision process for this follows below.

Thus, to solve the problem of only-NaN pages, the comments in the spec are extended to mandate the following behavior:

  • Once a writer writes the nan_count/nan_counts fields, they have to:
    • never write NaN into min/max if there are non-NaN non-Null values and
    • always write min=max=NaN if the only non-null values in a page are NaN
  • A reader observing that nan_count/nan_counts field was written can then rely on that if min or max are NaN, then both have to be NaN and this means that the only non-NULL values are NaN.

Should we write NaN or nothing for only-NaN pages & column chunks?

Here are the cons of each approach and how to mitigate them:

CONs for writing NaN in this case:

  • Writing NaN breaks with the "don't write NaN into min and max bounds" rule.
    • However, one could argue that breaking the rule in this edge case is okay, as since if NaN is the only value in a page, then it doesn't matter where to sort NaN w.r.t. other values, as there are no other values. It is the only value in the page, so it is the min and max of the page
    • Breaking this rule has no consequences on existing readers, as they should ignore NaN anyway, i.e., treat it as if it wasn't written, so legacy readers should treat both cases the same anyway.
  • There might be existing writers that have written NaN for min & max for pages that do not only contain NaN but also other values, so a reader couldn't rely on min=max=NaN to mean that the only non-null value is NaN
    • However, as specified, we enforce a stricter semantics once the nan_count field is written: We define that once a writing engine writes this field, it has to never write NaN into min/max if there are non-NaN non-Null values and always write min=max=NaN if the only non-null values in a page are NaN. Consequently, readers can rely on the semantics once they observe that the nan_count field is written and this becomes a non-issue.
  • NaNs take more space than not writing the field or writing byte[0] in the column index. This space overhead should usually be negligible.

In conclusion, there is no big con for writing NaN. It can be implemented in a fully backward compatible way that still allows future writers and readers to apply a more strict semantics.

CONs for writing nothing in this case:

  • Writing byte[0] to a ColumnIndex might break older readers who expect the min_values/max_values field to be a value with correct length unless null_pages is true for the entry.
    • Although readers should be lenient enough and handle wrongly sized min/max values gracefully by ignoring them we cannot be sure this is the case for any reader. Thus, certain legacy spec-conforming readers will reject the new Parquet file, which is bad.
  • Omitting the min_value / max_value fields in Statistics is suboptimal, as it is indistinguishable from the writing engine not writing these fields for whatever other reason. Would we do this, then the page could never be pruned by a reader, as the reader couldn't rely on this meaning "there are only NaNs" vs "the writer hasn't written any min/max for this page".
    • We could define that a writer may not omit min/max if they write the nan_count field and must only omit them if a page has only NaNs, but this seems to be quite fishy, semancially.

In conclusion, the cons for the NaN approach have mitigations and can be handled in a backward compatible way, while the cons for writing nothing might be non-backward-compatible. Therefore, I propose to write NaN as min/max for only-nan pages & column chunks.

Considerations

Backward compatibility

The suggested change is fully backward compatible both in the read and write direction:

Older readers not supporting nan_count/nan_counts yet can stay as is. As the fields are optional, readers not supporting them will simply ignore them.

The spec already today mandates that if a reader sees NaN in min or max fields they should ignore it. They can continue doing so.

No older reader will have regressed performance; any page that an older reader would have skipped before can still be skipped.

Older writers can continue writing files without nan_count/nan_counts. Even if an old reader sets min=max=NaN for a page that does contain non-NaN values, readers supporting this new semantics will not misinterpret these bounds, as the writer will not write nan_count/nan_counts, so the new more strict semantics does not apply when reading.

As nan_count/nan_counts are local to the scopes where they apply (column index, page, column chunk), even stitching together row groups from a writer that didn't write them and a writer that does write them works. This would result in a file where some pages / column indexes / column chunks would have them written while others wouldn't.

Versioning

This proposal definitly does not require a major version bump, as it is fully backward compatible.

I do not fully understand the versioning policy of parquet, so I don't know whether this change would require a minor version bump. One could argue that it is not necessary as the mere existence of the nan_count/nan_counts field would be the "feature flag" that would indicate whether a writer supported this change or not. There wouldn't be a version check necessary in a reader; they would just need to check for the existence of the nan_count/nan_counts fields.

No Unnecessary Overhead

As thrift encodes missing optional fields with zero bytes in the compact protocol, non-FLOAT/DOUBLE columns will not incur any overhead for the new optional fields.

Design alternatives and why they were not chosen

Ordering NaN before or after all other values

We could simply define NaN to be smaller or greater than all other values and then allow NaN in the respective bound.

This however has many drawbacks:

  • NaN is the widest bound possible, so adding NaNs to min and max isn't very useful, as it makes pruning for non-NaN values almost impossible in the respective direction.
  • As mentioned, not all systems agree on whether NaN is larger or smaller than all other values. If we decided for one, systems that choose the other semantics couldn't filter effectively.
  • This contradicts the current spec of not writing NaN to min/max, so it would make older readers no longer skip pages they could skip before.

Adding a new ColumnOrder

We could add a new ColumnOrder that specifies NaN to be smaller or greater than all other values, or even supports -NaN and +NaN ordering them as smaller and larger than all values, respectively. For example, Iceberg mandates the following sort order:

-NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN

Once we define such an order, we could again allow NaN (and potentially -NaN) in bounds again.

This however has the following drawbacks:

  • As with the previous alternative: NaN is the widest bound possible, so adding NaNs to min and max isn't very useful, as it makes pruning for non-NaN values almost impossible in the respective direction. If we even allow -NaN and +NaN, a page containing both would have no meaningful min and max and wouldn't allow any pruning at all.
  • Most systems don't support -NaN, as mathematically speaking, it is nonsense. The only reason why it exists is that the physical reprsentation of floats has a sign bit that also exists for NaN representations.
  • The fact that NaNs being so unuseful for min/max bounds is displayed by the fact that even though Iceberg has such a well defined sort order, they still do what is proposed in this proposal and do not include -NaN/NaN into min/max bounds and rather track them through nan_counts.

All in all, any alternative putting NaN into min/max bounds (except for only-NaN-pages) suffers from the problem that NaN bounds are too wide and therefore not useful for pruning.

Writing a second value_counts list in the ColumnIndex

The column index does allow byte[0] values already, in case a page contains only nulls. We could enable the same for only-NaN pages by not storing only the nan_counts, but also the value_counts of a page. Then, one could check whether a page in the column index contains only NaNs by checking nan_count + nulls_count = value_count. Hoewever, this would mean yet another list in the column index, making the column index bigger and more expensive to deserialize. And while the nan_counts list only exists for FLOAT/DOUBLE columns, the value_counts list would exist for all columns and therefore take up considerably more space. Also, this would not be backward compatible, as an older reader wouldn't know of the new lists, so it would see a byte[0] and would need to treat it as invalid.

All in all, the extra list doesn't seem to add enough value for its cost (especially also cost on non-float columns) and reduced backward compatibility.

Do nothing

As long as we don't do anything, columns containing NaNs will almost be useless for scan pruning. The problems outlined will persist, making double columns almost unprunable for some predicates. That is not satisfactory.

Why wasn't sort order tackled?

Even with this improvement which fixes the semantics of NaN values in statistics, NaN values are still a problem in some other places as there is still not a defined order for them, so the boundary_order in a column index and the SortingColumn would still have undefined placement for NaN.

This mainly wasn't tackled for two reasons:

  • It is an orthogonal issue. This improvement is about enabling NaNs in statistics, so after this change all statistics can handle NaN in a well-defined way. Sort odering of columns to leverage boundary_order or SortingColumn needs to be solved in a different way anyway, as the mere information about whether (only or some) NaNs are present isn't enough here but it needs to be defined whether they come before or after all values.
    • Even though we could fix both statistics and sort order by just defining NaN to be smaller or greater than other values, doing so for statistics is not a good idea, as having NaN in bounds makes too wide bounds that aren't helpful for filtering.
    • If sort ordering will be fixed by a different commit one day, the design of this commit shouldn't have a (negative) influence on that future design, as NaN counts and not including NaNs into bounds is a good thing to do anyway.
  • While fixing statistics with NaN counts is pretty uncontested w.r.t. design alternatives (see the respective section for a discussion why), the design to be chosen for sort orders isn't that clear:
    • We could define a new ColumnOrder with well defined NaN ordering. This would be the cleanest fix, but also a "very big gun", as this would be the first non-default column order in existence. Also, we would have to decide whether we want to have different sort orders for nans first, nans last, and +/-nan allowed.
    • We could define a nans_first field which tells whether NaNs are to be sorted before or after other values, akin to the already existing field nulls_first. This would be a more micro-invasive change, but it would be less clean IMHO, as there is a tool for defining column ordering--the ColumnOrder--and not using that tool but working around it feels hacky.

Thus, sort ordering of NaNs wasn't tackled in this commit. They can be tackled by a follow-up change if necessary.

Epilogue

Make sure you have checked all steps below.

Jira

  • [x] My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
    • https://issues.apache.org/jira/browse/PARQUET-XXX
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Commits

  • [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • [x] In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

JFinis avatar Mar 22 '23 11:03 JFinis

Thus, to solve the problem of only-NaN pages, the comments in the spec are extended to mandate the following behavior:

Once a writer writes the nan_count/nan_counts fields, they have to: never write NaN into min/max if there are non-NaN non-Null values and always write min=max=NaN if the only non-null values in a page are NaN A reader observing that nan_count/nan_counts field was written can then rely on that if min or max are NaN, then both have to be NaN and this means that the only non-NULL values are NaN.

Instead of writing min and max as NaN when there are only NaN values and then let the reader to check whether min max NaN are credible by evaluating whether naNCounts is empty, wouldn't it be much simpler if we just left the evaluation of isNaN and notNaN to the reader? A reader can always conclude a page / column is all NaN when value count of the field == NaN count of the filed (when valueCounts and naNCounts both exists), this's Iceberg's current way of evaluating isNaN. Is there anything wrong with doing this in Parquet?

zhongyujiang avatar Mar 23 '23 13:03 zhongyujiang

@zhongyujiang (as I can't answer your comment directly). Here is the problem with your suggestion of checking nanCount == valueCount for checking for only NaNs:

@mapleFU To your general comment (I can't answer there)

The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can decide nan by min-max. Can we just decide it by null_count + nan_count == num_values?

The problem is that the ColumnIndex does not have the num_values field, so using this computation to derive whether there are only NaNs would only be applicable to Statistics, not to the column index. Of course, we could do what I suggested in alternatives and give the column index a num_values list. Then this would indeed work everywhere but at the cost of an additional list.

So I see we have the following options:

  • Do what I did here, i.e., use min/max to determine whether there are only NaNs
  • Add a num_values list to the ColumnIndex
  • Accept the fact that the column index cannot detect only-NaN pages (might lead to fishy semantics)
  • Tell readers to use the min==max==NaN reasoning only in the column index, and use the null_count + nan_count == num_values for the statistics.

Which one would you suggest here?

JFinis avatar Mar 23 '23 15:03 JFinis

@JFinis Thanks for your reply, just realized that the page value count is stored in the page header, not in the column index. I overlooked your comments above before asked the question, sorry for that.

zhongyujiang avatar Mar 24 '23 01:03 zhongyujiang

The gist of all opened issues is the question how to encode pages/column chunks that contain only NaNs.

This is actually only an issue for the ColumnIndex. For statistics in the ColumnMetaData or the page, we can find only-Nan pages/columnChunks by computing num_values - null_count - nan_count == 0. The ColumnIndex doesn't have num_values, so we can't perform this computation.

I see three alternatives to handle the problem in the ColumnIndex:

  1. My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
  2. Adding num_values to the ColumnIndex, to make it symmetric with Statistics in pages & ColumnMetaData and to enable the computation num_values - null_count - nan_count == 0
  3. Adding a nan_pages bool list to the column index, which indicates whether a page contains only NaNs

I'm fine with either of these, so I would like us to reach a consensus for one of the alternatives here; then I can update my PR to reflect the decision. As this is my first contribution to parquet, I don't know the decision processes here. Do we vote? Is there a single or group of decision makers? Please let me know how to come to a conclusion here.

As a help for the decision: Here are again the PROs and CONs of the three alternativs:

  1. My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
  • PRO: Fully backward compatible
  • PRO: Needs no further lists in the ColumnIndex
  • CON: people are uneasy with storing NaNs in bounds, due to many existing bit patterns and therefore a bit fuzzy semantics.
  1. Adding num_values to the ColumnIndex, to make it symmetric with Statistics in pages & ColumnMetaData and to enable the computation num_values - null_count - nan_count == 0
  • PRO: No NaNs in bounds, no encoding/bit-pattern fuzzyness
  • PRO: Makes the ColumnIndex symmetric to other statistics (and to Apache Iceberg)
  • PRO: The num_values would also be viable for other purposes. It feels weirdly asymmetric to not have this field in the column index. For example, this would help to gauge the number of nested values in a nested column.
  • CON: The extra num_values list would be in each column index, even for non FLOAT/DOUBLE columns, thereby adding space consumption and encoding/decoding overhead.
  • CON: Would make null_pages redundant, as null_pages[i] == (num_values[i] - null_count[i] == 0)
  • CON: In theory not 100% backward compatible, but probably not relevant in practice*
  1. Adding a nan_pages bool list to the column index, which indicates whether a page contains only NaNs
  • PRO: No NaN encoding fuzzyness, no encoding/bit-pattern fuzzyness
  • PRO: Less space consumption than num_values. The list would only be present for FLOAT/DOUBLE columns
  • PRO: Along the lines of null_pages so following an existing pattern in the column index
  • CON: In theory not 100% backward compatible, but probably not relevant in practice*

* Explanation of "in theory not 100% backward compatible": Today, min and max in a column index have to have a valid value unless null_pages of the respective page is true. This would no longer hold if we decide to encode only-NaN pages through empty min/max + nan_pages or empty min/max + num_values. Thus, a legacy reader, who doesn't know the new lists, could come to the conclusion that the missing bounds constitute an invalid ColumnIndex and therefore might deem the whole Parquet file as invalid. I doubt that this is a problem in practice, as readers are written leniently. I.e., if a missing bound in a column index is encountered, the index might not be used (what would already happen today in case of an only-NaN page, so not a regression) or just that page might be treated as "has to be scanned". I don't know a reader that would reject the whole Parquet file in this case. Therefore, this is likely not relevant in practice.

JFinis avatar Mar 28 '23 08:03 JFinis

Thank you, @JFinis, for working on this. This is not an easy topic. I am afraid we cannot avoid encoding NaN values into column index min/max lists for the sake of backward compatibility: There is no such thing as "missing value" in the list. We encode actual primitive values. We need to store there something for each page. That's why we have null_pages to highlight that the values encoded for the corresponding page are valid or not. The only way I can think of being backward compatible is to store NaN values in min/max otherwise we mix up older readers.

gszadovszky avatar Mar 31 '23 13:03 gszadovszky

@JFinis Do you have a plan to revive this?

wgtmac avatar May 11 '23 02:05 wgtmac

I finally have time to continue on this. Sorry for the long wait.

As @gszadovszky has highlighted, we have to store a valid double/float value into the min/max bounds of the column index to be compatible with legacy readers. So the initial proposal to write NaN into min/max in this case would actually work.

But so far not everyone was happy with using these NaNs in readers to see whether we have an only-nan page. Therefore, the suggestion was to also add nan_pages to the column options (favored by @wgtmac and @mapleFU). I have updated the PR to this suggestion: We still would write NaNs into min/max in the column index if a page has only NaNs but advise the reader to not use these values (as readers are already advised today) and instead only use nan_pages to check for only-nan pages. This way, we don't need to worry about the semantics of NaN comparisions and readers can continue to ignore all NaN values they find in bounds.

For all other statistics, we would never (as before) write NaNs to bounds and instead now advice readers to use num_values == null_count + nan_count to detect only-NaN pages.

I have not updated the PR description yet to reflect this new design; only the files themselves have been updated. @wgtmac @mapleFU @gszadovszky Please review and let me know if you agree with this design. Then I will update the PR description accordingly.

JFinis avatar Jun 12 '23 10:06 JFinis

Let's cc some of the maintainers of parquet-rs: @adamgs @tustvold @alamb

pitrou avatar Jun 28 '23 15:06 pitrou

https://github.com/apache/parquet-format/pull/196#discussion_r1237381221 @alamb @tustvold Hi, I wonder for PageIndex pruning in Rust implementions, would it matter for adding [-inf, +inf] as min-max for all nan and null pages? Would it harm the column pruning for IS_NAN or other operations? Since I'm afraid this might break the current rust implementions, would you mind take a look?

mapleFU avatar Jun 30 '23 09:06 mapleFU

I wonder for PageIndex pruning in Rust implementions

Currently the arrow-rs implementation uses the totalOrder predicate as defined by the IEEE 754 (2008 revision) floating point standard to order floats, this can be very efficiently implemented using some bit-twiddling and at least appears to define the standardised way to handle this. I believe DataFusion is using these same comparison kernels for evaluating pruning predicates, and so I would expect it to have similar behaviour with regards to NaNs.

From the Rust docs:

The values are ordered in the following sequence:

negative quiet NaN
negative signaling NaN
negative infinity
negative numbers
negative subnormal numbers
negative zero
positive zero
positive subnormal numbers
positive numbers
positive infinity
positive signaling NaN
positive quiet NaN.

would it matter for adding [-inf, +inf] as min-max for all nan and null pages

I haven't read the full backscroll, but the original PR's suggestion of just writing a NaN for a page only containing NaN seems perfectly logical to me, unlikely to cause compatibility issues, and significantly less surprising than writing a value that doesn't actually appear in the data...

Let's cc some of the maintainers of parquet-rs:

I don't really know enough about the history of floating point comparison to weigh in on what the best solution is with any degree of authority, however, my 2 cents is that the totalOrder predicate is the standardised way to handle this.

Whilst I do agree that the behaviour of aggregate statistics containing NaNs might be unfortunate for some workloads, I'm not sure that special casing them is beneficial. Aside from the non-trivial additional complexity associated with special-casing them, if you don't include NaNs in statistics it is unclear to me how you can push down a comparison predicate as you have no way to know if the page contains NaNs? Perhaps that is what this PR seeks to address, but I do wonder if the simple solution might be worth considering...

Also tagging @crepererum who may have further thoughts

tustvold avatar Jun 30 '23 10:06 tustvold

Currently the arrow-rs implementation uses the totalOrder predicate as defined by the IEEE 754 (2008 revision) floating point standard to order floats, this can be very efficiently implemented using some bit-twiddling and at least appears to define the standardised way to handle this.

So arrow-rs has a nice handing on float/double comparings, I guess we only need to consider that the new data will not broken by stale parquet-mr reader? @gszadovszky

mapleFU avatar Jun 30 '23 11:06 mapleFU

@mapleFU, as I've written before that's why we initiated ColumnOrder to make the format open to specify orderings. I don't know how the other implementations use this already. In the current parquet-mr (since we introduced ColumnOrder) there is a logic that drops any statistics if the defined column order is not known. So we can safely initiate a new one. We can say that if the min/max value would contain a NaN, then we would write the new IEEE_754 column order otherwise TYPE_ORDER. In this case we can simple skip the additional lists for marking all-NaN pages and write the NaN values into the statistics instead. The question is how older readers of the other implementations would handle an unknown ColumnOrder. It is an implementation detail that the NaN handling is java is different from what IEEE 754 says. Java has only one NaN bitmap. So handling this ordering will require additional work. I hope it can be implemented in a performant way.

gszadovszky avatar Jun 30 '23 11:06 gszadovszky

I agree w/ @tustvold's standpoint. Some thoughts on top of what he wrote:

IMHO this is leaking application details into the storage format. If you start to differentiate NaN from "all normal values" and NULL you may do the same for +/-Inf, because it also acts as a poison value in most computations. But you may also do that for "nearly Inf" because someone divided by "nearly zero" and these super big values are equally nonsensical. This whole discussion isn't even specific to floats. Why do boolean stats not count true/false separately? What about empty strings and byte arrays? Or empty lists in general? My point is: this is opening a can of worms and the complexity isn't worth the gain.

The better alternative is: let the user cast invalid values to NULL if they wanna exclude them from their data, because this is exactly what missing values were invented for. If they still want to store broken data and want to have some niche understanding of statistics, provide a way to attach application-defined stats to parquet (this extends to a number of histogram types or counts of other "special" values). Keep the storage format baseline simple. IEEE total ordering is well defined and universally agreed upon. I think the world doesn't need yet another special floating point treatment.

crepererum avatar Jul 07 '23 10:07 crepererum

I think we already have type-defined order, and already exclude +inf and -inf. And not when if a page is all NaN, the page would be excluded

mapleFU avatar Jul 07 '23 11:07 mapleFU

I think we already have type-defined order

Indeed, and what I am suggesting is rather than layering on more complexity to workaround the problems of such an approach, how about we just remove this complexity? Perhaps defining a new ColumnOrder if necessary for backwards compatibility, although I am personally not sure this would even be necessary, as I suspect most readers already handle NaNs in some capacity even if they just ignore them.

tustvold avatar Jul 07 '23 11:07 tustvold

Okay, [-NaN, +NaN] as min-max would be ignored in C++ Statistics. I'm ok for these solutions.

mapleFU avatar Jul 07 '23 12:07 mapleFU

@tustvold @crepererum Do I interpret your answer correctly in that your suggestion would be to

  • Create a new ColumnOrder for floats that simply is defined as IEEE 754 total order, if we need such new order for backward compatibility (which we probably need, as apparently parquet-mr will otherwise perform filtering incorrectly)
  • When that order is used, don't handle NaNs explicitly. Instead, just use the total order relation for ordering and min/max computation (which will result in NaNs being written as max and -NaNs being written as min if they exist).

Did I get this right?

I guess this can also be implemented in each language by "bit casting" the float bits to integer bits and doing an integer comparison, correct? So even if the underlying language doesn't have native support for total ordering, it should still be possible to implement this.

I do see a certain beauty in this approach in it being "simple". As always, I'm happy to adapt my PR to this approach, if we can get consensus that we want this.

@mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this proposal?

JFinis avatar Jul 07 '23 12:07 JFinis

I guess this can also be implemented in each language by "bit casting" the float bits to integer bits and doing an integer comparison, correct

Its a bit more than a simple bit cast, but broadly speaking yes.

https://doc.rust-lang.org/src/core/num/f64.rs.html#1336

tustvold avatar Jul 07 '23 13:07 tustvold

The idea looks ok. I've check arrow parquet's reader implemention.

For statistics:

  1. If ColumnOrder is TypeDefinedOrder, uses min_value and max_value. But when it find it's a -Nan, +Nan pair, it would not set min-max, so, it's ok.
  2. Otherwise, uses min and max

For ColumnIndex, it doesn't check the cases for NaN, so it will believe the value in ColumnIndex. However, PageIndex is not widely used. So I think it's ok to support a order, and write [-Nan, +Nan]

Also cc @pitrou @wgtmac

mapleFU avatar Jul 07 '23 13:07 mapleFU

@mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this proposal?

It's difficult to say without understanding the implications. Say a data page contains some NaNs, what happens?

pitrou avatar Jul 07 '23 14:07 pitrou

@mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this proposal?

It's difficult to say without understanding the implications. Say a data page contains some NaNs, what happens?

@pitrou On the write path:

  • The writing library would set the ColumnOrder for this column to the new option, let's call it IEEE754TotalOrder.
  • The writing library would use IEEE754 total order for all order / sorting related tasks. I.e., it would compute the min and max values of the page using that total order. That order has a defined place for NaN. The writer would not have special logic for NaN. It would just order everything using total order. E.g., in case of a page containing a positive NaN, this would be chosen as the max value, as Nan > everything else in the total order.

On the read path:

  • A reading engine encountering the new IEEE754TotalOrder column order would either a) (legacy reader) not understand it and in this case not use any statistics, as it doesn't understand the semantics of the order relation. b) (new reader) understand it and assume that all order is in IEEE 754 total order, which again has a defined place for NaNs. Depending on the NaN semantics of the reading engine, it would need to make sure to align the values it sees in min/max with its own semantics. How this alignment would look like would depend on the semantics of the engine. (I can give more detailed examples for different engine semantics if necessary)

Ramifications:

  • PRO: Due to the new column order, legacy readers are guarded. They don't need to understand the new order. Even if they ignore the column order, if they see NaNs in min and max they will just ignore them and assume the statistics aren't usable. So we have two layers of protection to make sure legacy readers don't misunderstand the ordering.
  • PRO: No special fields for NaNs. No nan_counts, no nan_pages. Instead, NaNs are just treated as defined in the total ordering.
  • PRO: Simple standardized handling of floatsinstead of special handling of NaNs. I guess this was the main point of @tustvold and @crepererum.
  • PRO: Engines only need to understand total ordering and don't need any special NaN handling code anymore (unless their semantics is different, in which case they need to map their semantics from / to total ordering).
  • CON: NaNs will be used in min/max bounds, even for not only-NaN pages. This makes them less effective for filtering (as they are the widest possible bounds) but @crepererum made a good point that this "special case for NaN" is quite arbitrary and we could also special case INT_MAX for integer columns, e.g.. I see the argument that keeping the architecture simple might be preferrable. Also NaNs are not widely used, so this will not be determinental to many data sets.

JFinis avatar Jul 07 '23 14:07 JFinis

@JFinis Thanks a lot! I agree that makes sense. The main problem IMHO is that old readers wouldn't support page filtering on such files. That said, we have to move forward somehow.

pitrou avatar Jul 07 '23 15:07 pitrou

CON: NaNs will be used in min/max bounds, even for not only-NaN pages. This makes them less effective for filtering (as they are the widest possible bounds) but @crepererum made a good point that this "special case for NaN" is quite arbitrary and we could also special case INT_MAX for integer columns, e.g.. I see the argument that keeping the architecture simple might be preferrable. Also NaNs are not widely used, so this will not be determinental to many data sets.

I agree this is a con. Total ordering is nice if the goal is to order the data. If the goal is to filter the data then I think any consideration of NaN/null/infinity is meaningless.

However, I also agree with @crepererum that this is a slippery slope and I agree with @JFinis that NaNs are not widely used and simpler is better. I don't entirely agree the solution can always be to replace NaN/Infinity with NULL but the cases where it can't are probably very rare. Besides, the penalty here is only a performance loss and not incorrect results so it's more manageable.

So, on the balance, I'd say I'm neutral. If there are other advantages to this approach then the disadvantages to dataset filtering are probably not enough outweigh them. We might want to add a small sentence to some kind of pyarrow or parquet documentation somewhere so that users can be aware of this.

westonpace avatar Jul 07 '23 16:07 westonpace

The new IEEE754TotalOrder looks elegant to me, though a single NaN value may still ruin the page index. Another challenge is how parquet-mr can efficiently implement this sort order.

wgtmac avatar Jul 08 '23 15:07 wgtmac

To support old readers with the statistics we can choose to write TypeDefinedOrder for FP values in case there are no NaN values in the data.

gszadovszky avatar Jul 10 '23 07:07 gszadovszky

Just coming back to this as it has come up a bit downstream, the approach described in https://github.com/apache/parquet-format/pull/196#issuecomment-1625537697 makes a lot of sense to me. Would it help move this forward if I were to raise a separate PR proposing it?

parquet-mr can efficiently implement this sort order

Provided Java provides some mechanism to interpret a float as an integer, it is just a case of some bit operations - https://doc.rust-lang.org/src/core/num/f64.rs.html#1336

Total ordering is nice if the goal is to order the data If the goal is to filter the data then I think any consideration of NaN/null/infinity is meaningless

Why would filter predicates not also need a well-defined order? FWIW arrow-rs uses total order for all floating point comparison.

tustvold avatar Nov 09 '23 11:11 tustvold

@tustvold I actually already have the change ready in my local repo. I was just distracted by other work and it seemed there was little interest so far in advancing this quickly, so I didn't update it on github, yet. I can update the PR tomorrow :).

JFinis avatar Nov 09 '23 18:11 JFinis

I hate to not stick to my word, but I won't be able to create the PR today, as my wife is going into labor and I'll have to drive her to the clinic soon 😅.

I pushed the status I have so far to my fork. You can already have a look if you want: https://github.com/jfinis/parquet-format/tree/totalorder

The commit is basically done, I just wanted to proof read everything and write a descriptive message for the commit and the PR. I'll find some time once we're back from the hospital, i.e., in a few days. But for now, I first need to deliver something else 👶 .

JFinis avatar Nov 10 '23 07:11 JFinis

Congratulations! Take all the time you need, there is no urgency on this from my end, just wanted to avoid things stalling out

tustvold avatar Nov 10 '23 12:11 tustvold

Okay, finally done. As the new solution (total order) does not share a single line with the current solution and this PR gets quite long and contrived, I created a new PR: https://github.com/apache/parquet-format/pull/221

I hope this is fine. If you rather want me to continue in this PR, let me know, then I'll close the other one and instead update this one. Otherwise, let's continue the discussion about total order in the new PR :).

@tustvold @mapleFU @wgtmac @crepererum @etseidl @gszadovszky @pitrou FYI

JFinis avatar Nov 22 '23 19:11 JFinis