parquet-format PARQUET-758: Add Float16/Half-float logical type

In the Mailing List, I proposed the addition of a Half Float (float16) type in Parquet: https://lists.apache.org/thread/03vmcj7ygwvsbno764vd1hr954p62zr5

This type is becoming increasingly popular in Machine Learning, and there are a bundle of issues requesting its support in Parquet: https://issues.apache.org/jira/browse/PARQUET-1647 https://issues.apache.org/jira/browse/PARQUET-758 https://issues.apache.org/jira/browse/ARROW-17464 https://github.com/apache/arrow/issues/2691

This is my first logical type proposal! I followed this PR as inspiration, but really welcome feedback from the community.

Make sure you have checked all steps below.

Jira

[X] My PR addresses the following Parquet Jira 1 and 2 issues and references them in the PR title.

Commits

[x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

[X] In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

Aug 26 '22 21:08 anjakefala

@anjakefala You need to add to the LogicalType union, not to the Type enum (which is for physical types).

Also cc @emkornfield

Aug 29 '22 09:08 pitrou

We should probably specify that using the Byte Split Encodings can be used for this type as well?

Also, in general, if possible try to avoid force pushing, as it makes it harder to compare iterative changes (this might not be the style in this repo, though so if you found instructions elsewhere on this, please ignore).

Aug 30 '22 06:08 emkornfield

It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts?

Aug 30 '22 06:08 emkornfield

It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts?

I think the basic idea behind having physical and logical types is to support forward compatibility since we can always represent (somehow) a long-existing physical type while logical types are getting extended. Parquet-mr should work fine with "unknown" logical types by reading it back as an un-annotated physical vale (a Binary with two bytes in this case). So, if the community supports having a half-precision floating point type I would vote on specifying it as a logical type.

The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. (BTW the sorting order of floating point numbers are still an open issue: PARQUET-1222)

Aug 30 '22 07:08 gszadovszky

It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type.

While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190 (npy_half is just an unsigned 16-bit integer in this context)

Aug 30 '22 08:08 pitrou

It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type.

While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190 (npy_half is just an unsigned 16-bit integer in this context)

It is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See PARQUET-1222 for details.

Aug 30 '22 08:08 gszadovszky

t is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See PARQUET-1222 for details.

I think these are orthogonal. I might be missing something but it seems like it would not be to hard to case float16 to float in java/cpp and do the comparison in that space and cast it back down. This might not be the most efficient implementation but would be straightforward? I am probably missing something. It would be nice to resolve PARQUET-1222 so the same semantics would apply to all floating point numbers.

The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics).

It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today?

Aug 31 '22 03:08 emkornfield

I've came up with this ordering thing because we specify it for every logical types. (Unfortunately we don't do this for primitive types.) Therefore, I would expect to have the order specified for this new logical type as well which is not trivial and requires to solve PARQUET-1222. At least we should add a note about this issue.

It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today?

I do not have the proper environment to test it but based on the code we do not handle unknown logical types well in parquet-mr. I think it handles unknown logical types as if they were not there at all which is fine from the data point of view but we would blindly use the statistics which may cause data loss. Created PARQUET-2182 to track this.

Aug 31 '22 05:08 gszadovszky

I think Parquet C++ probably has the same issue. Let me reread PARQUET-1222. to see what the current state is and if we can push it forward.

Sep 04 '22 05:09 emkornfield

I agree with @emkornfield that ordering issues seem largely orthogonal, as they also affect FLOAT32 and FLOAT64 types.

Sep 07 '22 11:09 pitrou

@pitrou @emkornfield @gszadovszky

Is there anything I can do to move this addition forward? Can I help with any code?

In terms of design, my understanding from reading the comments is that @gszadovszky brought up an ordering concern (valid, but not a blocker?), and that a decision needs to be made on whether float16 would be implemented as a logical or physical type?

Sep 29 '22 22:09 anjakefala

Sorry for the delay, it sounds like PARQUET-1222 is blocker, let me make a proposal there and see if we can at least come to consensus on approach and maybe this feature can be the first test-case for it.

Sep 30 '22 05:09 emkornfield

Sorry for the delay but PARQUET-1222 has now been merged, so I think this is unblocked.

Dec 07 '22 17:12 emkornfield

Thanks so much for the update @emkornfield!

What is the next step I can take?

Dec 07 '22 22:12 anjakefala

@anjakefala IIUC, I think the prior objection was around not properly floating point sorting for statistics correctly. I think the main thing is to update the specification to say this requires the same sorting logic as float32 and float64. I need to rereview the current state of things, but then I think we can probably try to vote on the mailing list to see if this type is acceptable. I'm not sure on the exact process here (I don't know if implementations are required before a vote). @gszadovszky thoughts?

Dec 07 '22 22:12 emkornfield

Thank you @emkornfield! I added the sort order to the spec.

Dec 14 '22 22:12 anjakefala

Hey @emkornfield! Is it reasonable for me to send a proposal to the mailing list for a vote? It seems @gszadovszky is not available for insight; is there anyone else that can provide it?

Jan 09 '23 19:01 anjakefala

@shangxinli are there guidelines for what needs to happen to accept this addition?

Feb 01 '23 17:02 emkornfield

@shangxinli are there guidelines for what needs to happen to accept this addition?

I suppose it needs a discussion and then a formal vote on the ML?

Feb 01 '23 17:02 pitrou

As @julienledem mentioned in the email discussion, let's have the corresponding PRs for support in the Java and C++ implementation ready before we merge this PR. We would like to have implementation support when the new type is released.

Feb 01 '23 18:02 shangxinli

As @julienledem mentioned in the email discussion, let's have the corresponding PRs for support in the Java and C++ implementation ready before we merge this PR. We would like to have implementation support when the new type is released.

It might have missed it but I didn't see Julien's reply on the dev mailing list. This seems reasonable though.

Feb 13 '23 07:02 emkornfield

It might have missed it but I didn't see Julien's reply on the dev mailing list. This seems reasonable though.

For full disclosure, it was a discussion involving the Parquet PMC and myself after I messaged the PMC to inquire about the way forward on this proposal.

Feb 13 '23 11:02 pitrou

@emkornfield apologies for this, I realize I replied to a thread on the private list. @pitrou emailed private@ to get more eyes on this. Which worked. But yes, this discussion didn't need to be private. I replied: "I would recommend having the corresponding PRs for support in the Java and C++ implementation ready as well for this to be merged so that we don't release a new type that does not have support in the implementation."

Feb 13 '23 17:02 julienledem

It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts?

I think the basic idea behind having physical and logical types is to support forward compatibility since we can always represent (somehow) a long-existing physical type while logical types are getting extended. Parquet-mr should work fine with "unknown" logical types by reading it back as an un-annotated physical vale (a Binary with two bytes in this case). So, if the community supports having a half-precision floating point type I would vote on specifying it as a logical type.

The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. (BTW the sorting order of floating point numbers are still an open issue: PARQUET-1222)

FWIW, I rather think it should be a physical type for the following reasons:

encodings are currently only defined on the physical type, not the logical one. So allowing BYTE_STREAM_SPLIT for this type would actually break this if it is a logical type.
Having this be a logical type while float and double are physical types seems inconsistent.
There might eventually be hardware support or native language support for this for this type. In this case, having it as physical type would allow easier to leverage this hardware / language support, as most libraries instantiate encoders/decoders based on the physical type. Again, having now one exception where you would need a decoder based on a logical type would break this pattern and require additional effort. If Java and C++ had a float16 type, I guess more people would agree that it should be a physical type. So is the intuition of this being a logical type just based on the yet missing language support for this?
IMHO, the basic idea behind physical and logical types is not to support forward compatibility; that is just a byproduct. Otherwise, there should just be one or two physical types in the first place (FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY). The basic idea is rather to make a distinction between physical representation and what the values logically mean. In my mental model it is rather a layered approach: There are layers that only care about the physical types (e.g., the encoders/decoders) and then further layers that also care about the logical type (e.g. the statistics maintenance code). And here again, this would break this layering.

Jun 12 '23 13:06 JFinis

FWIW, I rather think it should be a physical type for the following reasons:

encodings are currently only defined on the physical type, not the logical one. So allowing BYTE_STREAM_SPLIT for this type would actually break this if it is a logical type.

This seems reasonable, but BYTE_STREAM_SPLIT is the only relevant encoding here (and it's probably not widely used yet).

Having this be a logical type while float and double are physical types seems inconsistent.

While I agree it seems inconsistent, this is an idealistic argument. If Float16 had been included from scratch, it would be logical ( :-) ) to make it a physical type because there would not be any compatibility problem.

There might eventually be hardware support or native language support for this for this type. In this case, having it as physical type would allow easier to leverage this hardware / language support, as most libraries instantiate encoders/decoders based on the physical type.

I'm not sure by how much making Float16 a physical type would make encoding/decoding easier. The main change would be that bytewidth is known at compile time, instead of at runtime. Other than that, having HW support for Float16 would not change the ease of copying data two bytes at a time...

IMHO, the basic idea behind physical and logical types is not to support forward compatibility; that is just a byproduct. Otherwise, there should just be one or two physical types in the first place (FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY).

That's partly true, but once the Parquet format is widely used, forward compatibility becomes a significant concern.

If Float16 is a new physical type, existing systems will probably not be able to read data with such a column at all. If Float16 is a new logical type, existing systems should be able to read the data; they just won't be able to draw any insight from the Float16 columns (but will be able to process the other columns).

To sum it up:

making Float16 a logical type would make adoption of the new type faster as compatibility with existing systems would be preserved
making Float16 a physical type would be more consistent overall, and would allow applying the BYTE_STREAM_SPLIT encoding

Jun 14 '23 15:06 pitrou

I think that compatibility in Parquet file format is such a strong requirement that extending primitive types is simply not an option. (I agree though, if I would introduce a new file format I would specify only 3 primitive types: FIXED_LEN, VAR_LEN, BIT. But we already have what we have.) Meanwhile, I don't see why the Float16 physical type would be a requirement to use BYTE_STREAM_SPLIT. I don't think it is widely used so we can update the related spec to allow this encoding to be used for any primitive type (except BOOLEAN). Then, it is up to the implementations to use it for FIXED_LEN_BYTE_ARRAY[2] (FLOAT16).

Jun 15 '23 06:06 gszadovszky

PR for C++ float16 implementation in Parquet: https://github.com/apache/arrow/pull/36073

Jun 15 '23 15:06 anjakefala

It seems that both the Java implementation and the C++ implementation are in a state of readiness.

Has the vote started? Can anyone with visibility update me on the status?

Oct 04 '23 17:10 anjakefala

@anjakefala Agreed that everything seems to be in place. I'll be starting the vote on the ML later today.

Oct 04 '23 17:10 benibus

@pitrou @emkornfield @gszadovszky @JFinis @julienledem @shangxinli

The vote has been started by @benibus here: https://lists.apache.org/thread/gyvqcx9ssxkjlrwogqwy7n4z6ofdm871 Please chime in! I would also appreciate anyone forwarding the vote to the private listserv.

Oct 05 '23 17:10 anjakefala

parquet-format parquet-format copied to clipboard

PARQUET-758: Add Float16/Half-float logical type

Jira

Commits

Documentation

parquet-format
parquet-format copied to clipboard