seafowl icon indicating copy to clipboard operation
seafowl copied to clipboard

JSON support

Open mildbyte opened this issue 2 years ago • 1 comments

Support for querying / writing JSON values, similar to SQLite's support (https://www.sqlite.org/json1.html) and PG's support (https://www.postgresql.org/docs/14/functions-json.html).

Ideally should be built as a separate crate (e.g. datafusion-json / seafowl-datafusion-json) that defines DataFusion UDFs/UDAFs to manipulate JSON data, but if it isn't implementable as that, we can keep it in Seafowl or contribute as a patch to DataFusion.

Minimum useful set of functions

  • json_extract (SQLite-style) / json_extract_path (PG-style): get a value out of a JSON value
  • json_set: replace/add a value inside of a JSON value
  • json_group_array+json_group_object (SQLite-style) / json_agg+json_object_agg (PG-style): aggregation function that turns a table into a JSON object/array
  • json_object/json_array: create an inline JSON object/array from the function arguments

Investigate how to "tag" a JSON value using Arrow's type system

SQLite tries to get away with treating JSON values as strings, but sometimes we do need to distinguish between a string and a string that is actually a JSON object.

There's a difference between

json_set('{"a":2,"c":4}', '$.a', '{"b": 42}')

and

json_set('{"a":2,"c":4}', '$.a',  json('{"b": 42}'))

In the first case, we're setting the value of a to be the string {"b": 42}. In the second case, we're setting it to a JSON object. We might need to define an equivalent function / Arrow datatype (even if we ultimately store JSON values as strings in the Parquet file, we need to occasionally distinguish between strings and strings that are JSON values in our type system).

We could use the Arrow struct type (https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/datatype.rs#L169), though it requires the types of the values to be predefined. We could instead make the struct have a single field, json: Utf8, that Seafowl will know to treat as JSON.

Some inspiration from Snowflake? https://docs.snowflake.com/en/sql-reference/data-types-semistructured.html

Table-valued functions (set-returning functions)

SQLite JSON functions like json_each/json_tree and PG JSON functions like json_each/json_array_elements (that return SETOF) actually return a table that we can then join on.

User-defined table functions are currently unsupported in DataFusion (https://github.com/apache/arrow-datafusion/issues/1604 / https://github.com/apache/arrow-datafusion/issues/1608): we could contribute support for them and then build JSON support on top of that, or implement it using DataFusion's custom logical plan / optimizer rules if possible.

mildbyte avatar Oct 03 '22 10:10 mildbyte