flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-34111][table] Add support for json_quote, json_unquote, address PR feedback #24156

Open anupamaggarwal opened this issue 1 year ago • 2 comments

What is the purpose of the change

Adds implementation of json_quote and json_unquote functions. PR is a continuation of https://github.com/apache/flink/pull/24156 and addresses feedback received on the original PR.

Brief change log / changes compared to PR 24156

  • Update implementation for json_quote with escaping logic
  • Update implementation for json_unquote to unescape special characters
    • Delegates valid json checking to SqlJsonUtils#isJsonValue
  • Add/ update some additional tests
  • Updates documentation to remove reference to MySql utf8mb4 encoding

Open questions

  • High level behavior for invalid json in json_unquote (pass original input as is, Vs throwing exception)
  • To check for JSON_VALIDITY for json_unquote logic for IS JSON is used but in certain cases IS JSON behaviour does not seem to be consistent with Json spec as documented here https://www.json.org/json-en.html
    • According to json spec """" should not be a valid json (which is also flagged by JsonLint) but select '""""' IS JSON returns true
      • In case of invalid json string for json_unquote an exception is thrown in case of Mysql (However we won't do the same)

References

  • postgres implementation for escape

Verifying this change

  • Added ITs in JsonFunctionsITCase

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (docs)

anupamaggarwal avatar Jun 20 '24 12:06 anupamaggarwal

CI report:

  • 570d28feff96fdbc19606fd89814429fa2ab0513 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Jun 20 '24 12:06 flinkbot

@flinkbot run azure

anupamaggarwal avatar Jun 20 '24 15:06 anupamaggarwal

Thanks for the contribution @anupamaggarwal I checked for most cases it works well

I found one issue with symbols encoded outside of BMP

e.g.

SELECT '≠';

This works fine now if we try it with newly added functions

SELECT json_quote('≠');

it fails as

Caused by: java.util.IllegalFormatConversionException: x != java.lang.Character
	at java.base/java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:4426)
	at java.base/java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2938)
	at java.base/java.util.Formatter$FormatSpecifier.print(Formatter.java:2892)
	at java.base/java.util.Formatter.format(Formatter.java:2673)
	at java.base/java.util.Formatter.format(Formatter.java:2609)
	at java.base/java.lang.String.format(String.java:2897)
	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.formatStr(JsonQuoteFunction.java:77)
	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.quote(JsonQuoteFunction.java:68)
	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.eval(JsonQuoteFunction.java:88)
	at StreamExecCalc$3.processElement(Unknown Source)
	at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
	at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
	at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
	at ...

I tend to think that here

	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.formatStr(JsonQuoteFunction.java:77)
	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.quote(JsonQuoteFunction.java:68)
	at org.apache.flink.table.runtime.functions.scalar.JsonQuoteFunction.eval(JsonQuoteFunction.java:88)

need to work with codepoints rather than chars

At the same time query

SELECT json_unquote('≠');

works fine.

I also checked same queries for MySQL and they work perfectly fine

snuyanzin avatar Jul 04 '24 07:07 snuyanzin

thanks for your review @snuyanzin ! Fixed in 4267018

anupamaggarwal avatar Jul 05 '24 10:07 anupamaggarwal

thanks @snuyanzin for your review! I think both these cases are related to inconsistent behavior (IIUC) of IS JSON function which is used in json_unquote implementation for checking JSON validity. I had observed something similar earlier (see Open questions section of the PR description.

SELECT JSON_UNQUOTE('"1""1"');

As you pointed out this should ideally throw an exception (as mysql), however IS JSON returns true in this case (IIUC this should be a bug? since 1""1 is not a valid json)

select '"1""1"' is json; //true 

If we fix this by throwing an exception (preferable), it could be in-consistent with what IS JSON considers as valid json? Should we care about compatibility with IS JSON behavior here (or just fix this and file a JIRA for IS JSON?)

Same is the case with

SELECT JSON_UNQUOTE('"1""\uzzzz"')
select '"1""\uzzzz"' is JSON; //true

However in this case we throw an exception (from json_unquote) since \uzzzz is not a valid literal

For

SELECT JSON_UNQUOTE('"1\uzzzz"')

IS JSON returns FALSE and we return the input string back per documentation.

SELECT JSON_UNQUOTE('""\ufffa"'); it starts printing backslash as well which is NOT OK

Hmm, I tested this through the sql-client.sh, and I get " which seems expected? (as ""\ufffa" is a valid JSON )

anupamaggarwal avatar Jul 10 '24 02:07 anupamaggarwal

Thanks for the explanation

However in this case we throw an exception (from json_unquote) since \uzzzz is not a valid literal

that's the main point, since in related docs of this PR is written [1]

If the value starts and ends with double quotes but is not a valid JSON string literal, the value is passed through unmodified.

So based on the doc instead of exception I would expect same value as it was in input

[1] https://github.com/apache/flink/pull/24967/files#diff-539fb22ee6aeee4cf07230bb4155500c6680c4cc889260e2c58bfa9d63fb7de5R385

snuyanzin avatar Jul 10 '24 06:07 snuyanzin

One more diff between MySQL and Flink behavior, however this I'm not sure need to fix since at least from my point of view MySQL's behavior is questionable MySQL

SELECT  JSON_UNQUOTE('"\\u0022\\u005c\\u005c\\u0075\\u0030\\u0030\\u0061\\u0061\\u0022"');
SELECT  JSON_UNQUOTE(JSON_UNQUOTE('"\\u0022\\u005c\\u005c\\u0075\\u0030\\u0030\\u0061\\u0061\\u0022"'));

returns

"\\u00aa"
\u00aa  

Flink

SELECT  JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"');
SELECT  JSON_UNQUOTE(JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"'));

returns

"\\u00aa"
ª 

snuyanzin avatar Jul 11 '24 23:07 snuyanzin

thanks @snuyanzin, fixed in 7546d9c

Hmm, that's strange, when I tested

SELECT JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"'); SELECT JSON_UNQUOTE(JSON_UNQUOTE('"\u0022\u005c\u005c\u0075\u0030\u0030\u0061\u0061\u0022"'));

through sql-client I get the same result as MySql

I didn't entirely understand why this would be unexpected though For example, if I test (using sql-client) with

select json_unquote('"\u00aa"'); I get ª whereas select json_unquote('"\\u00aa"'); returns \u00aa

which seems ok to me, unless I missed something?

Added some tests in ScalarFunctionsTest

anupamaggarwal avatar Jul 17 '24 06:07 anupamaggarwal

yes, probably there is something wrong with my mysql... thanks for checking and thanks for addressing feedback

snuyanzin avatar Aug 07 '24 07:08 snuyanzin

@anupamaggarwal can you please rebase to the latest master to be sure that it still passes ci ? (I tend to think that current failure is not related to the changes in PR)

snuyanzin avatar Aug 07 '24 07:08 snuyanzin

@flinkbot run azure

anupamaggarwal avatar Aug 07 '24 17:08 anupamaggarwal

Thanks @snuyanzin I pulled the latest from master on my branch and the build is passing now (It seems CI is a bit flaky since one of the runs post merge failed but the second went through). The failure didn't seem related to this change though.

anupamaggarwal avatar Aug 08 '24 03:08 anupamaggarwal

Cool, thanks for your contribution @anupamaggarwal thanks for the review @fhueske

if there is no objections I'm going to merge later today or tomorrow

snuyanzin avatar Aug 08 '24 10:08 snuyanzin

Thank you both @anupamaggarwal and @snuyanzin!

fhueske avatar Aug 09 '24 07:08 fhueske