[FLINK-34111][table] Add support for json_quote, json_unquote, address PR feedback #24156
What is the purpose of the change
Adds implementation of json_quote and json_unquote functions.
PR is a continuation of https://github.com/apache/flink/pull/24156 and addresses feedback received on the original PR.
Brief change log / changes compared to PR 24156
- Update implementation for
json_quotewith escaping logic - Update implementation for
json_unquoteto unescape special characters- Delegates valid json checking to SqlJsonUtils#isJsonValue
- Add/ update some additional tests
- Updates documentation to remove reference to MySql utf8mb4 encoding
Open questions
- High level behavior for invalid json in json_unquote (pass original input as is, Vs throwing exception)
- To check for JSON_VALIDITY for
json_unquotelogic forIS JSONis used but in certain casesIS JSONbehaviour does not seem to be consistent with Json spec as documented here https://www.json.org/json-en.html- According to json spec
""""should not be a valid json (which is also flagged by JsonLint) butselect '""""' IS JSONreturns true- In case of invalid json string for json_unquote an exception is thrown in case of Mysql (However we won't do the same)
- According to json spec
References
- postgres implementation for escape
Verifying this change
- Added ITs in JsonFunctionsITCase
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
@Public(Evolving): (no) - The serializers: (no)
- The runtime per-record code paths (performance sensitive): (no)
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
- The S3 file system connector: (no)
Documentation
- Does this pull request introduce a new feature? (yes)
- If yes, how is the feature documented? (docs)
CI report:
- 570d28feff96fdbc19606fd89814429fa2ab0513 Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
@flinkbot run azure
Thanks for the contribution @anupamaggarwal I checked for most cases it works well
I found one issue with symbols encoded outside of BMP
e.g.
SELECT '≠';
This works fine now if we try it with newly added functions
SELECT json_quote('≠');
it fails as
Caused by: java.util.IllegalFormatConversionException: x != java.lang.Character
at java.base/java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:4426)
at java.base/java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2938)
at java.base/java.util.Formatter$FormatSpecifier.print(Formatter.java:2892)
at java.base/java.util.Formatter.format(Formatter.java:2673)
at java.base/java.util.Formatter.format(Formatter.java:2609)
at java.base/java.lang.String.format(String.java:2897)
at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.formatStr(JsonQuoteFunction.java:77)
at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.quote(JsonQuoteFunction.java:68)
at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.eval(JsonQuoteFunction.java:88)
at StreamExecCalc$3.processElement(Unknown Source)
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
at ...
I tend to think that here
at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.formatStr(JsonQuoteFunction.java:77)
at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.quote(JsonQuoteFunction.java:68)
at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.eval(JsonQuoteFunction.java:88)
need to work with codepoints rather than chars
At the same time query
SELECT json_unquote('≠');
works fine.
I also checked same queries for MySQL and they work perfectly fine
thanks for your review @snuyanzin ! Fixed in 4267018
thanks @snuyanzin for your review!
I think both these cases are related to inconsistent behavior (IIUC) of IS JSON function which is used in json_unquote implementation for checking JSON validity. I had observed something similar earlier (see Open questions section of the PR description.
SELECT JSON_UNQUOTE('"1""1"');
As you pointed out this should ideally throw an exception (as mysql), however IS JSON returns true in this case (IIUC this should be a bug? since 1""1 is not a valid json)
select '"1""1"' is json; //true
If we fix this by throwing an exception (preferable), it could be in-consistent with what IS JSON considers as valid json? Should we care about compatibility with IS JSON behavior here (or just fix this and file a JIRA for IS JSON?)
Same is the case with
SELECT JSON_UNQUOTE('"1""\uzzzz"')
select '"1""\uzzzz"' is JSON; //true
However in this case we throw an exception (from json_unquote) since \uzzzz is not a valid literal
For
SELECT JSON_UNQUOTE('"1\uzzzz"')
IS JSON returns FALSE and we return the input string back per documentation.
SELECT JSON_UNQUOTE('""\ufffa"'); it starts printing backslash as well which is NOT OK
Hmm, I tested this through the sql-client.sh, and I get " which seems expected? (as ""\ufffa" is a valid JSON )
Thanks for the explanation
However in this case we throw an exception (from json_unquote) since \uzzzz is not a valid literal
that's the main point, since in related docs of this PR is written [1]
If the value starts and ends with double quotes but is not a valid JSON string literal, the value is passed through unmodified.
So based on the doc instead of exception I would expect same value as it was in input
[1] https://github.com/apache/flink/pull/24967/files#diff-539fb22ee6aeee4cf07230bb4155500c6680c4cc889260e2c58bfa9d63fb7de5R385
One more diff between MySQL and Flink behavior, however this I'm not sure need to fix since at least from my point of view MySQL's behavior is questionable MySQL
SELECT JSON_UNQUOTE('"\\u0022\\u005c\\u005c\\u0075\\u0030\\u0030\\u0061\\u0061\\u0022"');
SELECT JSON_UNQUOTE(JSON_UNQUOTE('"\\u0022\\u005c\\u005c\\u0075\\u0030\\u0030\\u0061\\u0061\\u0022"'));
returns
"\\u00aa"
\u00aa
Flink
SELECT JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"');
SELECT JSON_UNQUOTE(JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"'));
returns
"\\u00aa"
ª
thanks @snuyanzin, fixed in 7546d9c
Hmm, that's strange, when I tested
SELECT JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"'); SELECT JSON_UNQUOTE(JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"'));
through sql-client I get the same result as MySql
I didn't entirely understand why this would be unexpected though For example, if I test (using sql-client) with
select json_unquote('"\u00aa"'); I get ª
whereas
select json_unquote('"\\u00aa"'); returns \u00aa
which seems ok to me, unless I missed something?
Added some tests in ScalarFunctionsTest
yes, probably there is something wrong with my mysql... thanks for checking and thanks for addressing feedback
@anupamaggarwal can you please rebase to the latest master to be sure that it still passes ci ? (I tend to think that current failure is not related to the changes in PR)
@flinkbot run azure
Thanks @snuyanzin I pulled the latest from master on my branch and the build is passing now (It seems CI is a bit flaky since one of the runs post merge failed but the second went through). The failure didn't seem related to this change though.
Cool, thanks for your contribution @anupamaggarwal thanks for the review @fhueske
if there is no objections I'm going to merge later today or tomorrow
Thank you both @anupamaggarwal and @snuyanzin!