feat: `to_json_string` (alternatively, `to_json`, but that could get confusing with the JSON type)
Is your feature request related to a problem?
Some DB engines provide a TO_JSON_STRING or TO_JSON method to get a JSON serialization of an arbitrary value.
What is the motivation behind your request?
As seen in https://github.com/ibis-project/ibis/pull/9470, it can be useful to get a string representation of types, where cast to string is either ambiguous or not supported.
BigQuery DataFrames uses TO_JSON_STRING for the same reason (fallback for types that don't support cast to string) as well as for interop with extensions such as passing rows to Remote Functions.
Describe the solution you'd like
Value.to_json_string() would make sense to me. I would avoid Value.to_json(), as I would expect that to return the JSON type in engines that support it.
What version of ibis are you running?
8.x, working on 9.x upgrade
What backend(s) are you using, if any?
BigQuery
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Seems like a good thing to add that wouldn't be particularly contentious, especially given it's already supported by multiple backends!
Following #9788, I think we might be better served by an as_* prefix here, so as_json_string().
Looking more into this, IIUC semantically this should be the same as:
t.some_col.cast("json").cast("string")
If that's correct, I wonder if we could just stick with that spelling as a user-facing API. For backends that implement an optimized function (and may not handle this in their optimizer themselves?) we could always use a simple rewrite rule to compile to a specific one-call version like bigquery's TO_JSON_STRING.
BigQuery doesn't support casting to JSON directly.
WITH
Sequences AS (
SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
SELECT [2, 4, 8, 16, 32] UNION ALL
SELECT [5, 10]
)
SELECT CAST(some_numbers AS JSON) FROM Sequences
fails with Invalid cast from ARRAY<INT64> to JSON at [7:13]
But, I can do this:
WITH
Sequences AS (
SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
SELECT [2, 4, 8, 16, 32] UNION ALL
SELECT [5, 10]
)
SELECT TO_JSON(some_numbers) FROM Sequences
Similarly, I can't cast from JSON to STRING.
WITH
Sequences AS (
SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
SELECT [2, 4, 8, 16, 32] UNION ALL
SELECT [5, 10]
)
SELECT CAST(TO_JSON(some_numbers) AS STRING) FROM Sequences
fails with Invalid cast from JSON to STRING at [7:13].
So, the SQL for t.some_col.cast("json").cast("string") without any rewrites would be
WITH
Sequences AS (
SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
SELECT [2, 4, 8, 16, 32] UNION ALL
SELECT [5, 10]
)
SELECT TO_JSON_STRING(TO_JSON(some_numbers)) FROM Sequences
Comparing the previous SQL with one that omits TO_JSON, BQ uses about 1/2 the "slots" with this SQL (eyeballing based on about 5 query runs each):
WITH
Sequences AS (
SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
SELECT [2, 4, 8, 16, 32] UNION ALL
SELECT [5, 10]
)
SELECT TO_JSON_STRING(some_numbers) FROM Sequences
I suspect the reason for the difference is that the JSON type in BQ is optimized for flexible schemas and lookups, so it is a more expensive type to build than just going straight to a JSON string.
That's all to say, a specific optimization for t.some_col.cast("json").cast("string") seems like it could be worthwhile.
I am a fan of .cast("json").cast("string") being the user facing API, and not introducing something new just for this.