ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: `to_json_string` (alternatively, `to_json`, but that could get confusing with the JSON type)

Open tswast opened this issue 1 year ago • 4 comments

Is your feature request related to a problem?

Some DB engines provide a TO_JSON_STRING or TO_JSON method to get a JSON serialization of an arbitrary value.

What is the motivation behind your request?

As seen in https://github.com/ibis-project/ibis/pull/9470, it can be useful to get a string representation of types, where cast to string is either ambiguous or not supported.

BigQuery DataFrames uses TO_JSON_STRING for the same reason (fallback for types that don't support cast to string) as well as for interop with extensions such as passing rows to Remote Functions.

Describe the solution you'd like

Value.to_json_string() would make sense to me. I would avoid Value.to_json(), as I would expect that to return the JSON type in engines that support it.

What version of ibis are you running?

8.x, working on 9.x upgrade

What backend(s) are you using, if any?

BigQuery

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

tswast avatar Jul 09 '24 16:07 tswast

Seems like a good thing to add that wouldn't be particularly contentious, especially given it's already supported by multiple backends!

deepyaman avatar Jul 12 '24 18:07 deepyaman

Following #9788, I think we might be better served by an as_* prefix here, so as_json_string().

jcrist avatar Aug 07 '24 17:08 jcrist

Looking more into this, IIUC semantically this should be the same as:

t.some_col.cast("json").cast("string")

If that's correct, I wonder if we could just stick with that spelling as a user-facing API. For backends that implement an optimized function (and may not handle this in their optimizer themselves?) we could always use a simple rewrite rule to compile to a specific one-call version like bigquery's TO_JSON_STRING.

jcrist avatar Aug 14 '24 19:08 jcrist

BigQuery doesn't support casting to JSON directly.

WITH
  Sequences AS (
    SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
    SELECT [2, 4, 8, 16, 32] UNION ALL
    SELECT [5, 10]
  )
SELECT CAST(some_numbers AS JSON) FROM Sequences

fails with Invalid cast from ARRAY<INT64> to JSON at [7:13]

But, I can do this:

WITH
  Sequences AS (
    SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
    SELECT [2, 4, 8, 16, 32] UNION ALL
    SELECT [5, 10]
  )
SELECT TO_JSON(some_numbers) FROM Sequences

Similarly, I can't cast from JSON to STRING.

WITH
  Sequences AS (
    SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
    SELECT [2, 4, 8, 16, 32] UNION ALL
    SELECT [5, 10]
  )
SELECT CAST(TO_JSON(some_numbers) AS STRING) FROM Sequences

fails with Invalid cast from JSON to STRING at [7:13].

So, the SQL for t.some_col.cast("json").cast("string") without any rewrites would be

WITH
  Sequences AS (
    SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
    SELECT [2, 4, 8, 16, 32] UNION ALL
    SELECT [5, 10]
  )
SELECT TO_JSON_STRING(TO_JSON(some_numbers)) FROM Sequences

Comparing the previous SQL with one that omits TO_JSON, BQ uses about 1/2 the "slots" with this SQL (eyeballing based on about 5 query runs each):

WITH
  Sequences AS (
    SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
    SELECT [2, 4, 8, 16, 32] UNION ALL
    SELECT [5, 10]
  )
SELECT TO_JSON_STRING(some_numbers) FROM Sequences

I suspect the reason for the difference is that the JSON type in BQ is optimized for flexible schemas and lookups, so it is a more expensive type to build than just going straight to a JSON string.

That's all to say, a specific optimization for t.some_col.cast("json").cast("string") seems like it could be worthwhile.

tswast avatar Sep 13 '24 16:09 tswast

I am a fan of .cast("json").cast("string") being the user facing API, and not introducing something new just for this.

NickCrews avatar Jan 17 '25 04:01 NickCrews