iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Spec: add variant type

Open aihuaxu opened this issue 1 year ago • 6 comments

Help: #10392

Spec: add variant type

Proposal: https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit

This is to layout the spec for variant type. The specs are placed in Parquet project (see variant spec and shredding spec.

aihuaxu avatar Jul 31 '24 23:07 aihuaxu

cc @rdblue, @RussellSpitzer and @flyrain

aihuaxu avatar Jul 31 '24 23:07 aihuaxu

I do want to make sure we don't do a hostile fork here of the spec from Spark so let's make sure we get support from them to move the spec here before we merge. At the same time we should start going through wordings and continue to discuss the specs. I still think that would be easier to do in a public Google Doc though than in Github IMHO.

RussellSpitzer avatar Aug 01 '24 15:08 RussellSpitzer

I do want to make sure we don't do a hostile fork here of the spec from Spark so let's make sure we get support from them to move the spec here before we merge. At the same time we should start going through wordings and continue to discuss the specs. I still think that would be easier to do in a public Google Doc though than in Github IMHO.

Definitely. It's not for merge yet. I'm mostly trying to get the comments in place. Make sense to move that to google doc and link here.

sfc-gh-aixu avatar Aug 01 '24 16:08 sfc-gh-aixu

This needs some notes in Partition Transforms , I think explicitly we should disallow identity

For Appendix B - We should define something or state explicitly we don't define it for variant.

Appendix C - We'll need some details on the JSON serialization since that's going to have to define some string representations I think

Under Sort Orders we should probably note you cannot sort on a Variant?

Appendix D: Single Value Serialzation needs an entry, we can probably right "Not SUpported" for now, Json needs an entry

RussellSpitzer avatar Oct 17 '24 18:10 RussellSpitzer

And an entry https://github.com/apache/iceberg/blob/main/format/spec.md#parquet

RussellSpitzer avatar Oct 17 '24 18:10 RussellSpitzer

This needs some notes in Partition Transforms , I think explicitly we should disallow identity

For Appendix B - We should define something or state explicitly we don't define it for variant.

Appendix C - We'll need some details on the JSON serialization since that's going to have to define some string representations I think

Under Sort Orders we should probably note you cannot sort on a Variant?

Appendix D: Single Value Serialzation needs an entry, we can probably right "Not SUpported" for now, Json needs an entry

Thanks @RussellSpitzer I missed those sections and just updated.

I mark Partition Transforms, sorting and hashing not supported/allowed for now. For Appendix C, I think it should be just variant, similar to primitive type, since it's Iceberg schema as I understand the section.

aihuaxu avatar Oct 18 '24 18:10 aihuaxu

@aihuaxu, I think there are a couple of things missing:

  • The Avro appendix should be updated to state that a Variant is stored as a Record with two fields, a required binary metadata and a required binary value. Shredding is not supported in Avro.
  • The ORC appendix should be updated to state that a Variant is stored as a struct with two fields, a required binary metadata and a required binary value. Type attribute should be iceberg.struct-type=variant. Shredding is not supported in ORC.

rdblue avatar Oct 24 '24 22:10 rdblue

Oops. I didn't mean to close this.

rdblue avatar Oct 24 '24 22:10 rdblue

@aihuaxu, I think there are a couple of things missing:

  • The Avro appendix should be updated to state that a Variant is stored as a Record with two fields, a required binary metadata and a required binary value. Shredding is not supported in Avro.
  • The ORC appendix should be updated to state that a Variant is stored as a struct with two fields, a required binary metadata and a required binary value. Type attribute should be iceberg.struct-type=variant. Shredding is not supported in ORC.

Thanks @rdblue I thought we will make changes when we start to work on Avro/ORC. I added that.

I don't have much context for Json conversion. Not sure if we need to add more info.

aihuaxu avatar Oct 25 '24 20:10 aihuaxu

I believe I have addressed the comments and can we move forward to merge the PR? Let me know if I miss anything. cc @RussellSpitzer and @rdblue

aihuaxu avatar Jan 07 '25 17:01 aihuaxu

Merged, thanks everyone for your thoughtful feedback. Thanks @aihuaxu for the pr. Thanks to @rdblue , @emkornfield , @findepi , @XBaith , and @flyrain for your contributions reviewing.

RussellSpitzer avatar Jan 29 '25 22:01 RussellSpitzer

Merged, thanks everyone for your thoughtful feedback. Thanks @aihuaxu for the pr. Thanks to @rdblue , @emkornfield , @findepi , @XBaith , and @flyrain for your contributions reviewing.

Thanks a lot for the discussion, and thanks for merging @RussellSpitzer .

sfc-gh-aixu avatar Jan 30 '25 00:01 sfc-gh-aixu