datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[EPIC] Support `VARIANT` type for unstructured data

Open alamb opened this issue 7 months ago • 7 comments

Is your feature request related to a problem or challenge?

Processing semi-structured data (basically think anything that can be represented in JSON) efficiently is becoming more and more important.

As @wjones127 says in https://github.com/apache/datafusion/issues/10987>

This would be a high-performance data type for semi-structured data, designed for better OLAP performance than JSON or BSON (discussed in #7845).

While it is certainly possible to implement semi-structured, JSON and even Variant support today using the DataFusion extension apis (e.g. https://github.com/datafusion-contrib/datafusion-functions-json) this ticket tracks adding such support to DataFusion itself

Parquet recently adopted the Variant type : https://github.com/apache/parquet-format/blob/master/VariantEncoding.md

We see adoption of this in other systems as well such as Iceberg and Spark.

  • https://github.com/apache/iceberg/issues/10392

I think DataBricks did a good job describing its rationale:

  • https://www.databricks.com/blog/introducing-open-variant-data-type-delta-lake-and-apache-spark

Without Variant, customers had to choose between flexibility and performance. To maintain flexibility, customers would store JSON in single columns as strings. To see better performance, customers would apply strict schematizing approaches with structs, which requires separate processes to maintain and update with schema changes. With Variant, customers can retain flexibility (there's no need to define an explicit schema) and receive vastly improved performance compared to querying the JSON as a string.

Describe the solution you'd like

No response

Describe alternatives you've considered

This will be a big project. Here are some of the related pre-requisites

  • [x] https://github.com/apache/arrow-rs/issues/6736
  • [ ] https://github.com/apache/arrow-rs/issues/8480
  • [x] https://github.com/apache/datafusion/issues/14993
  • [ ] https://github.com/apache/datafusion/issues/12644

It is not clear to me if variant should be "built in" or if it should be an add on (for example, add a variant feature and a datafusion-variant crate)

Additional context

Related tickets

  • https://github.com/apache/datafusion/issues/10987
  • https://github.com/apache/datafusion/discussions/15264
  • https://github.com/apache/datafusion/discussions/9103

alamb avatar May 20 '25 15:05 alamb

Just listing a few specific places where I've had to integrate extension types outside of existing DataFusion mechanisms:

  • The Signature (i.e., how do you use the Signature mechanism to match a udf). One can also just use a kernel that matches anything and error in return_field() as well (and coerce the types yourself). If I'm remembering correctly the internal logic to coerce existing types isn't exposed in a way that makes it easy to that.
  • Casting: the Cast struct uses DataType (and anything that uses arrow-rs' casting will too). One can work around this by defining a UDF (e.g., custom_cast(x, to)) where to is a null scalar of the appropriate type.
  • Printing: the out-of-the box output you'll get is probably not what should be printed when you query a Parquet file with a variant column. (Could/should re-use the cast to string?)
  • CSV output (Could/should re-use the cast to string?)
  • The Parquet reader outputting the extension field (I assume this is in the works/is a parquet crate issue?)
  • Multiple Parquet files/unioning: out-of-the box a shredded and unshredded variant will probably fail to concatenate because of the differing layout (slash two shredded variants probably will too).
  • SQL parsing/unparsing (mostly expressing the type name)
  • Use of statistics to do pruning. I think DataFusion automatically disables Parquet pruning when it sees a column reference to a nested type like a struct (and its notion of a Column doesn't support nesting, and its notion of Statistics is somewhat limited), so there are possibly a few battles on this one.
  • Probably more!

It is not clear to me if variant should be "built in" or if it should be an add on (for example, add a variant feature and a datafusion-variant crate)

It's definitely easier to hard-code a type, although I think DataFusion will be better for allowing injected behaviour for more than just variant. Variant is definitely shinier, but UUIDs and geometry have many of the same problems. I'll put out a link to the vctrs R package ( https://vctrs.r-lib.org/articles/s3-vector.html ) which is a truly exceptional example of decentralized custom typing that supports custom printing, math, casting, and coercion for parameterized and unparameterized array types in R. Mostly this involves access to a registry of types in places that are currently stateless (can also be a static global variable like in Arrow C++ although I think the SessionContext is a better home).

paleolimbot avatar Jul 15 '25 14:07 paleolimbot

A brief update here -- as of Arrow 57.0.0, I think there will be enough Variant support in the parquet crate to actually use it in DataFusion and there is enough support, as described in Implementing User Defined Types and Custom Metadata in DataFusion for extension types

I would suggest starting off with the basic input/output:

Add some udfs for calling the arrow kernels

  1. variant_to_json to convert a variant column to String (and JSON)
  2. json_to_variant: the opposite of the above
  3. variant_get: Extract subpaths
  4. cast_to_variant: casting columns to variant

I don't think any of this code should be in the "core" datafusion crate, and we could make a datafusion-variant crate either in the main repo, or maybe even in datafusion-contrib

I am hoping that any features driven by variant (e.g. user defined casting for example) would result in APIs in the core that are implemented by the variant crate

alamb avatar Sep 26 '25 19:09 alamb

Update here is that I have a branch / PR with an upgrade to a pre-release version of arrow 57 that includes variant support:

  • https://github.com/apache/datafusion/pull/17888

The next steps I think would be to make a branch starting at https://github.com/apache/datafusion/pull/17888 and implement the functions above, and start writing some tests, etc

alamb avatar Oct 04 '25 09:10 alamb

Hi hi, I'm going to start working on the UDFs. I'm happy to upstream this as a separate repo

friendlymatthew avatar Oct 07 '25 12:10 friendlymatthew

Outstanding -- thanks @friendlymatthew that is great news!

Would you like a separate repo in datafusion-contrib perhaps? We could potentially also reuse the https://github.com/datafusion-contrib/datafusion-functions-variant repo

Also, FYI I think the WIP upgrade to arrow 57 in https://github.com/apache/datafusion/pull/17888 is now ready for use as a base (it passes all tests)

While details of the variant implementation may change some more, the actual kernels (variant_get, cast_to_variant, etc) have all been pretty stable

alamb avatar Oct 07 '25 15:10 alamb

Outstanding -- thanks @friendlymatthew that is great news!

Would you like a separate repo in datafusion-contrib perhaps? We could potentially also reuse the https://github.com/datafusion-contrib/datafusion-functions-variant repo

Out of respect to the work done in functions-variant, could we make a new repo? I already started work in https://github.com/friendlymatthew/datafusion-variant, I'm happy to transfer this over

friendlymatthew avatar Oct 07 '25 15:10 friendlymatthew

Its transferred! 🛳️

https://github.com/datafusion-contrib/datafusion-variant

alamb avatar Oct 07 '25 16:10 alamb

Hi @alamb, I saw this epic, and I wanted to start contributing to Datafusion for some time. Can I help you or @friendlymatthew with anything in this epic?

carpecodeum avatar Dec 15 '25 00:12 carpecodeum

Hi @carpecodeum -- I suggest looking at the issues in https://github.com/datafusion-contrib/datafusion-variant -- we are working on the integration there

alamb avatar Dec 16 '25 14:12 alamb