seafowl
seafowl copied to clipboard
JSON support
Support for querying / writing JSON values, similar to SQLite's support (https://www.sqlite.org/json1.html) and PG's support (https://www.postgresql.org/docs/14/functions-json.html).
Ideally should be built as a separate crate (e.g. datafusion-json / seafowl-datafusion-json) that defines DataFusion UDFs/UDAFs to manipulate JSON data, but if it isn't implementable as that, we can keep it in Seafowl or contribute as a patch to DataFusion.
Minimum useful set of functions
-
json_extract
(SQLite-style) /json_extract_path
(PG-style): get a value out of a JSON value -
json_set
: replace/add a value inside of a JSON value -
json_group_array
+json_group_object
(SQLite-style) /json_agg
+json_object_agg
(PG-style): aggregation function that turns a table into a JSON object/array -
json_object
/json_array
: create an inline JSON object/array from the function arguments
Investigate how to "tag" a JSON value using Arrow's type system
SQLite tries to get away with treating JSON values as strings, but sometimes we do need to distinguish between a string and a string that is actually a JSON object.
There's a difference between
json_set('{"a":2,"c":4}', '$.a', '{"b": 42}')
and
json_set('{"a":2,"c":4}', '$.a', json('{"b": 42}'))
In the first case, we're setting the value of a
to be the string {"b": 42}
. In the second case, we're setting it to a JSON object. We might need to define an equivalent function / Arrow datatype (even if we ultimately store JSON values as strings in the Parquet file, we need to occasionally distinguish between strings and strings that are JSON values in our type system).
We could use the Arrow struct type (https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/datatype.rs#L169), though it requires the types of the values to be predefined. We could instead make the struct have a single field, json: Utf8
, that Seafowl will know to treat as JSON.
Some inspiration from Snowflake? https://docs.snowflake.com/en/sql-reference/data-types-semistructured.html
Table-valued functions (set-returning functions)
SQLite JSON functions like json_each
/json_tree
and PG JSON functions like json_each
/json_array_elements
(that return SETOF
) actually return a table that we can then join on.
User-defined table functions are currently unsupported in DataFusion (https://github.com/apache/arrow-datafusion/issues/1604 / https://github.com/apache/arrow-datafusion/issues/1608): we could contribute support for them and then build JSON support on top of that, or implement it using DataFusion's custom logical plan / optimizer rules if possible.