ibis
ibis copied to clipboard
feat: JSON data types
Hi, I was wondering what would be the best way to treat JSON/JSONB columns from Postgres in Ibis. Currently I'm trying to treat it as a string (although it gets converted to a dict whenever I send the data to a pandas dataframe), but I'm not sure if that's the right solution.
Good question! What are you trying to do with the data? I haven't thought through how we could incorporate schema-less JSON into Ibis's data model. For nested data in Hive, Impala, Spark SQL, etc., it's a little more clear what should be done.
FWIW, Presto has a JSON type stored as a string. The main purpose it serves is to indicate that the underlying string is valid JSON. For example,
select json '[a"]'
fails because it can't be parsed as valid json, whereas
select json '["a"]'
is converted to json.
Additionally Presto uses a subset of JSONPath to let users pick out parts of the JSON blob that they want.
An ibis-worthy API might look something like this:
t.blob_of_json.extract('$.foo.bar')
t.blob_of_json.extract('$.foo.bar', type='int64')
t.blob_of_json.extract('$.foo').cast('array<int64>')
We could also leverage jsonpath-rw which is a Python DSL that generates JSONPath expressions.
Should we limit the scope here? E.g., arbitrary JSON is out of scope, but JSON with a well-defined, uniform schema per record (including arbitrary nesting) is in scope.
The Apache Drill data model page is an excellent read and IMO a nice indication of reasonable expectations to set for JSON functionality.
@wesm thoughts?
Makes sense to me. I think one of the challenges will be supporting nested data generally given the diverging semantics of different SQL engines (for example: comparing Hive, Presto, and Impala).
For the record it looks like each of the systems mentioned by @wesm have three different ways of unnesting for arrays (similar for maps):
Given a table t:
CREATE TABLE t (
elements ARRAY<STRING>
)
Hive:
SELECT element
FROM t
LATERAL VIEW explode(elements) s AS element;
Presto:
SELECT s.element
FROM t
CROSS JOIN unnest(elements) AS s (element)
Impala:
SELECT elements.item
FROM t, t.elements
I think we can come up with a reasonable API for doing these operations.
Cool thanks for looking into this. Might also be good to look at BigQuery for another data point. From the Ibis perspective, it might make the most sense to model this explicitly as a join. Expression analysis/validation may well become more complex, though.
BigQuery:
SELECT element
FROM t.elements
or
SELECT element
FROM UNNEST (t.elements)
The UNNEST is optional according to the docs
I'll make a new issue for the array API. I've started work on it by way of getting the type system in place and also added an "array_agg" function (called collect), indexing/slicing, and length computation. More details in the issue.
Just upvoting that JSON/JSONB support would be a nice feature.
We have JSON and JSONB types now, and with 3e2efb433 we have a minimal API for extracting object values or array elements. The remaining work is implement feature requests for additional APIs, so I'm going to close this out as completed.