ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: JSON data types

Open andrioni opened this issue 9 years ago • 11 comments

Hi, I was wondering what would be the best way to treat JSON/JSONB columns from Postgres in Ibis. Currently I'm trying to treat it as a string (although it gets converted to a dict whenever I send the data to a pandas dataframe), but I'm not sure if that's the right solution.

andrioni avatar Jan 20 '16 13:01 andrioni

Good question! What are you trying to do with the data? I haven't thought through how we could incorporate schema-less JSON into Ibis's data model. For nested data in Hive, Impala, Spark SQL, etc., it's a little more clear what should be done.

wesm avatar Feb 05 '16 22:02 wesm

FWIW, Presto has a JSON type stored as a string. The main purpose it serves is to indicate that the underlying string is valid JSON. For example,

select json '[a"]'

fails because it can't be parsed as valid json, whereas

select json '["a"]'

is converted to json.

cpcloud avatar May 04 '16 17:05 cpcloud

Additionally Presto uses a subset of JSONPath to let users pick out parts of the JSON blob that they want.

An ibis-worthy API might look something like this:

t.blob_of_json.extract('$.foo.bar')
t.blob_of_json.extract('$.foo.bar', type='int64')
t.blob_of_json.extract('$.foo').cast('array<int64>')

cpcloud avatar May 15 '16 17:05 cpcloud

We could also leverage jsonpath-rw which is a Python DSL that generates JSONPath expressions.

cpcloud avatar May 15 '16 17:05 cpcloud

Should we limit the scope here? E.g., arbitrary JSON is out of scope, but JSON with a well-defined, uniform schema per record (including arbitrary nesting) is in scope.

The Apache Drill data model page is an excellent read and IMO a nice indication of reasonable expectations to set for JSON functionality.

@wesm thoughts?

cpcloud avatar May 25 '16 04:05 cpcloud

Makes sense to me. I think one of the challenges will be supporting nested data generally given the diverging semantics of different SQL engines (for example: comparing Hive, Presto, and Impala).

wesm avatar May 26 '16 00:05 wesm

For the record it looks like each of the systems mentioned by @wesm have three different ways of unnesting for arrays (similar for maps):

Given a table t:

CREATE TABLE t (
  elements ARRAY<STRING>
)

Hive:

SELECT element
FROM t
LATERAL VIEW explode(elements) s AS element;

Presto:

SELECT s.element
FROM t
CROSS JOIN unnest(elements) AS s (element)

Impala:

SELECT elements.item
FROM t, t.elements

I think we can come up with a reasonable API for doing these operations.

cpcloud avatar Jun 15 '16 00:06 cpcloud

Cool thanks for looking into this. Might also be good to look at BigQuery for another data point. From the Ibis perspective, it might make the most sense to model this explicitly as a join. Expression analysis/validation may well become more complex, though.

wesm avatar Jun 15 '16 21:06 wesm

BigQuery:

SELECT element
FROM t.elements

or

SELECT element
FROM UNNEST (t.elements)

The UNNEST is optional according to the docs

cpcloud avatar Jun 16 '16 01:06 cpcloud

I'll make a new issue for the array API. I've started work on it by way of getting the type system in place and also added an "array_agg" function (called collect), indexing/slicing, and length computation. More details in the issue.

cpcloud avatar Jun 16 '16 01:06 cpcloud

Just upvoting that JSON/JSONB support would be a nice feature.

twiecki avatar Nov 27 '18 13:11 twiecki

We have JSON and JSONB types now, and with 3e2efb433 we have a minimal API for extracting object values or array elements. The remaining work is implement feature requests for additional APIs, so I'm going to close this out as completed.

cpcloud avatar Sep 21 '22 12:09 cpcloud