parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

PARQUET-756: Add Union Logical type

Open julienledem opened this issue 8 years ago • 28 comments

julienledem avatar Oct 26 '16 18:10 julienledem

@rdblue updated

julienledem avatar Oct 28 '16 01:10 julienledem

@rdblue I have updated the spec per your comments.

julienledem avatar Nov 01 '16 23:11 julienledem

I think we might need more discussion of how we want projects into unions to work.

I don't think all object models can return an "empty" union (eg thrift). Additionally, because we don't store group data, only data for each primitive column, in practice in order to do projections into unions you have to do things like pick an arbitrary child field of the union and use it's definition levels to figure out if the union "wing" is null or not. We've got a lot of these workarounds in the thrift implementation, but it'd be good to iron out how we really want that to work.

Another option would be to add a push down filter for any "branch" of a union that hasn't been selected in a projection.

isnotinvain avatar Nov 03 '16 07:11 isnotinvain

@isnotinvain, is the solution to the empty union simply that thrift can't project a subset of the union's branches?

I don't really like the idea of filtering rows that don't have non-null values, but I do see the problem. What about requiring the repetition of a union to be optional when projecting a subset of the branches?

rdblue avatar Nov 03 '16 16:11 rdblue

I'm not sure if I'm being clear, let me use an example of the issues we've seen with thrift union support.

  1. Selecting only columns from 1 "wing" of a union Lets say we have schema:
union Animal {
  1: optional Dog dog
  2: optional Cat cat
  3: optional Turtle turtle
}

struct Dog {
  1: required string name
  2: required string bark
}

struct Cat {
  1: required string name
  2: required int numLives
}

struct Turtle {
  1: required string name
}

And now, the user uses projection to select only the columns animal.cat.numLives In any of the type safe implementations of parquet, we still have to return the user an instance of Animal. In the case of a record that is a dog, what should we return them?

One option is to return an 'empty' Dog. This has two issues:

  1. Dog's fields are required -- thrift doesn't really let you get away with not setting those fields
  • this is solved by either putting null-like values (0, null, false, etc) in those fields
  • or by throwing an exception on access
  1. When parquet encounters this record it is unknown whether this record is a Dog or a Turtle. All we know is that it's not a Cat (it's cat field is known to be null because animal.cat.numLives's definition levels tell us that. But we didn't load up any Dog/Turtle columns so we don't know if the Animal's optional dog / turtle fields are null or not (we don't store group level data, it's only in the primitive columns). So what we've done here in the past is pick at least one arbitrary column from each "wing" of the union so that we can look at the definition levels of all "wings"

Does that make sense?

isnotinvain avatar Nov 04 '16 03:11 isnotinvain

Oh, and, that's why filtering the "unknowns" (the is this a dog or a turtle case) away sort of makes sense? Like, the user didn't ask for any dog/turtle columns so I guess they don't want those records? (probably more confusing than helpful)

isnotinvain avatar Nov 04 '16 03:11 isnotinvain

One more thing I should probably clarify -- this would be pretty easy if thrift represented unions the way they look in the above schema definition, we could just return something like:

Animal x = new Animal()
x.dog = null
x.cat = null
x.turtle = null
return x

The problem in thrift is there will be no Animal concrete class. Just an interface called Animal and 3 subclasses. There is an "unknown" subclass -- but that is usually reserved for when we know, based on the data on disk, that we don'k know what kind of union this is (because our schema is older than the writer's schema). Turning a known record into an unknown based on projections seems incorrect too.

isnotinvain avatar Nov 04 '16 03:11 isnotinvain

When selecting Cat in your example, we shouldn't create an empty Dog or Turtle, but instead return null because Cat was null. That's why I think we may want to require the union to be optional when projecting some of its branches: we will need to fill in with null for branches that aren't projected. For Thrift -- and Avro when there is no NULL option -- I think the consequence is not allowing the user to project a subset of the union's branches because it violates the contract of the requested schema to return null.

It is better to fail early in this projection case than to fake the other branches. Like you said, we have a problem if we don't know which other union branch was non-null. We also can't add a filter. Say I want to know the percentage of PetOwners in California that have cats. Then I filter PetOwner on state = 'CA' and project pet.cat. The null values (not cats) are relevant.

In terms of the spec for a UNION logical type, I think we end up with this: "When projecting a proper subset of the union's branches, the union itself must be optional. The union value should be null for branches that are not projected."

rdblue avatar Nov 04 '16 16:11 rdblue

Where do we want to document this? I'd separate what is defined by the format and what is specific to each model integration. To me it sounds like we'd want to formalize the details in parquet-thrift and parquet-avro that have their own specificities in the object model (and their could be more than one way of doing it per model). In LogicalTypes.md I'd add the following statements:

  • To know if a union is null you need to keep at least one column from one the branches.
  • if you project out some branches of the union the type of the union will be "unknown" for those at read time. The way to expose this is dependent on the object model integration (avro, thrift, ...). To know that type you need to keep at least one column from each branch.
  • Filtering "unknown" things out is defined by the model as well. And add the details about Thrift and Avro in their respective directory (possibly link from here)

julienledem avatar Nov 04 '16 17:11 julienledem

I rebased and addressed @rdblue latest feedback before the union projection discussion. I'm planning to add the details I mention in https://github.com/apache/parquet-format/pull/44#issuecomment-258503184 pending feedback from @rdblue and @isnotinvain .

julienledem avatar Nov 04 '16 18:11 julienledem

Implications in the object model should go in each model. Avro should document that it can't project unless the requested schema includes a NULL branch. However, the overall requirement that Parquet won't project a subset of the branches unless the union group is optional should be documented in the spec since that's a general requirement for how Parquet projection works.

rdblue avatar Nov 04 '16 18:11 rdblue

@rdblue Agreed except for the last bit: "Parquet won't project a subset of the branches unless the union group is optional" I don't think it should. Users should be able to project whatever they want and we should not add artificial constraints like this. If the union is required then a missing value just means "in one of the branches that you projected out". This is a totally valid use case. Typically you could want to read all cats and filter out all other branches of the union and you don't care whether the rows you ignored were turtles or dogs.

julienledem avatar Nov 04 '16 18:11 julienledem

@rdblue but this constraint can be defined in avro and thrift. Just not as a global rule.

julienledem avatar Nov 04 '16 18:11 julienledem

Users should be able to project whatever they want and we should not add artificial constraints like this.

I think you're right. I wasn't thinking that it was an artificial constraint, but there's no reason for Parquet to require this because its structure is clear.

rdblue avatar Nov 04 '16 20:11 rdblue

I'm not at a computer now but I will reply tonight. We can't return NULL for an option union even, as that's not really correct -- that union wasn't null it was a present Cat/Turtle -- claiming it was null is not really accurate I think.

On Friday, November 4, 2016, Ryan Blue [email protected] wrote:

Users should be able to project whatever they want and we should not add artificial constraints like this.

I think you're right. I wasn't thinking that it was an artificial constraint, but there's no reason for Parquet to require this because its structure is clear.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/parquet-format/pull/44#issuecomment-258540785, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWw6SOg3g-JADRVXTFIqP1SDhqnFApGks5q65crgaJpZM4KhhQQ .

isnotinvain avatar Nov 04 '16 22:11 isnotinvain

I think it's for the object model to decide how to handle the projection, but that null is a reasonable option. The object model could expose the union as individual branches, or could create objects like Dog and Turtle with no fields, or have a "not project" sentinel. It's really up to the model implementers to determine. I think for Avro, we'll go with null and document that null will be returned when non-null branches aren't projected.

rdblue avatar Nov 04 '16 22:11 rdblue

@rdblue @isnotinvain I have updated the doc with more details about projecting unions. Could you create each a PR for the following?

  • @rdblue: avro rules
  • @isnotinvain: thrift rules

thank you.

julienledem avatar Nov 08 '16 23:11 julienledem

I still have some reservations about this. I think an expected contract (but maybe not an explicit contract) that we have is that changing your projection shouldn't change your results, only your efficiency?

Lets say a user's query is something like:

long dogs = 0;
long cats = 0;
long others = 0;
Animal a = parquetData.next();

if (a.isDog()) {
  dogs++;
} else if (a.isCat()) {
  cats++;
} else {
  others++;
}

If the users selects all the columns, they'll get an accurate count of dogs and cats, and others. If they select only some columns from Dog, then a.isCat() will return false, and that cat will get counted as an 'other'

This seems pretty surprising to me, and is why in the thrift integration we went through a lot of hoops to try to avoid it. I think if parquet is going to support unions as a first class concept, instead of pushing these complicated decisions to each object model, it'd be nice if parquet could handle this for us for all object models.

We could for example, write in parquet-core an efficient way to read on the definition levels of one child of each union branch, so that we can tell each object model which wing of the union an object belongs to. Then the object model can do w/e it wants with that info, but I think handeling this belongs in parquet-core ideally, and isn't super difficult nor inefficient (assuming we can read definition levels only).

isnotinvain avatar Nov 09 '16 00:11 isnotinvain

Another thing to consider about adding first class support for unions is efficiency. We've found that parquet pays a pretty high cost reading nulls, especially in large schemas made up of lots of unions (in this case, each record is fairly small, but the schema is very large because there are so may fields total, but each record only populates a small number of them)

If we take advantage of knowing about unions in parquet-core, the read state machine / record assembly could skip a huge amount of asking column readers "are you null?" when it knows that they will all be null due to the fact that they are children of a union and that only one of the branches will ever contain non-nulls. Does that make sense?

isnotinvain avatar Nov 09 '16 00:11 isnotinvain

@rdblue @julienledem We're looking to use union type in my company, and I found this JIRA. Wondering the status of this PR and why it's not merged in the end? Thanks!

dbtsai avatar Jan 25 '19 17:01 dbtsai

@dbtsai, the problem with Union is that its behavior isn't well-defined. It is difficult to decide what is correct, and support in processing engines is bad. I don't think it is a good idea to add them when you can instead get predictable behavior using optional objects. (Also, see the discussion on the Iceberg spec.)

rdblue avatar Jan 25 '19 18:01 rdblue

Hi @rdblue @julienledem, I'm working with @dbtsai on this feature. We are scoping out support of UNION through AVRO -> Spark -> Parquet and its interesting to read this spec and the concerns that come out of it.

  • Some concerns here seem to revolve around the user-understood semantics of querying fields in UNIONS; I'd suggest that these semantics may already be solidified via other paradigms; for instance a SparkSQL selector that selects fields from an optional nested type will return null if the parent type does not exist.
  • It seems a little chicken-and-egg in terms of tooling support; in relation to the above could these behaviors be codified in the spec? It seems to me that the starting point of adding UNION support in the tooling would be to ensure that the formats support it. In terms of Parquet and interop with other formats this behavior is already undefined somewhat (in that the conversion code decides the embedding of UNIONS from one format to another); moving this behavior into the spec/format I think would help to close the gap on this slightly undefined behavior.

Please let me know your thoughts!

matlarsen avatar Jan 26 '19 00:01 matlarsen

I'd be interested in having unions in the Parquet format. It would have to annotate a STRUCT having a first field that is an INT32 indicating which of the subsequent fields should be selected for each row in the dataset. Other fields can simply be null when they are unselected in the union

wesm avatar Feb 01 '19 03:02 wesm

I have not been active on this recently. If someone wants to push this to the finish line they should feel free to take over this PR

julienledem avatar Sep 18 '21 01:09 julienledem

What needs to be done to push it over the finish line?

codeinred avatar Jun 08 '23 19:06 codeinred