envoy icon indicating copy to clipboard operation
envoy copied to clipboard

json: new exception free sanitize method

Open wbpcode opened this issue 1 year ago • 6 comments

Commit Message: json: new exception free sanitize method Additional Description:

part of #35545. Because the Json sanitizing is necessary at workers threads, even we can compile the new Formatter implementation out from E-M, we still need a exception free version of sanitize() method.

This PR provide one by suppressing exception of nlohmann/json which only bring only ~10 lines new code. It may not best solution but is the solution with highest RIO.

Risk Level: low. Testing: fuzz, benchmark. Docs Changes: n/a. Release Notes: n/a. Platform Specific Features: n/a.

wbpcode avatar Aug 29 '24 02:08 wbpcode

cc @htuch

wbpcode avatar Aug 29 '24 02:08 wbpcode

Interesting. I'm rusty on nlohmann -- what's the functional difference? You are goign to add unit tests for the new way of calling the sanitizer, right?

The behavior of nlohmann self is completely same exception we will suppress the exceptions with a parameter in new sanitizeExceptionFree().

But the final behavior of sanitize() and sanitizeExceptionFree is a little different. The sanitize() will serialize all utf8 to format \xxx if the input contain invalid uft8. The sanitizeExceptionFree() will only replace invalid utf8 to U+FFFD and the correct utf8 will kept as expected.

I think we have discussed this in the #35598.

We should call out in tests any differences between the exception and exception-free options.

Of course.

wbpcode avatar Aug 29 '24 02:08 wbpcode

So if I have two different same-length strings that are completey invalid utf-8 then this impl will alias them together?

jmarantz avatar Aug 29 '24 02:08 jmarantz

So if I have two different same-length strings that are completey invalid utf-8 then this impl will alias them together?

Yeah. You cannot expect anything from invalid utf-8.

wbpcode avatar Aug 29 '24 03:08 wbpcode

Yes I can :)

I can expect that the serializer generates a byte sequence that's legal json, that deserializes into the original bytes. That's what the current one does.

However, I think in this case it's OK to document that the exception free one has a known limitation and will just drop invalid sequnces and replace them with something common.

jmarantz avatar Aug 29 '24 03:08 jmarantz

Yes I can :)

I can expect that the serializer generates a byte sequence that's legal json, that deserializes into the original bytes. That's what the current one does.

However, I think in this case it's OK to document that the exception free one has a known limitation and will just drop invalid sequnces and replace them with something common.

Maybe we can expect the nlohmann provide more error handler in future.

wbpcode avatar Aug 29 '24 03:08 wbpcode

I don't expect anything, but this is a regression to current behavior. I said it's OK we just have to decide we will lose info in that case, and document it at the API and in the doc for json.

jmarantz avatar Aug 29 '24 20:08 jmarantz

/wait

jmarantz avatar Aug 29 '24 20:08 jmarantz

/retest

wbpcode avatar Aug 30 '24 12:08 wbpcode

I just found the json_loader use similar way to suppress exceptions. 😆

https://github.com/envoyproxy/envoy/blob/main/source%2Fcommon%2Fjson%2Fjson_internal.cc#L571-L576

wbpcode avatar Aug 30 '24 12:08 wbpcode

/retest

wbpcode avatar Sep 01 '24 12:09 wbpcode

FWIW, /wait-any label is more appropriate than /wait at the cases where no explicit change is required.

wbpcode avatar Sep 01 '24 14:09 wbpcode

I have a new idea. See source/common/protobuf/yaml_utility.cc, function utf8CoerceToStructurallyValid.

Based on the logic in there we can write a scanner to see if utf8 is structurally valid. If it's not, just escape everything to unicode. If it is, then run the nlohmann sanitizer with the fffd replacement. The tests as is should pass.

Note that function actually does its own destructive replacement and so we need to make a parallel funciton that is just a predicate.

We will of course only run this logic when needSlowSanitize() is true.

WDYT?

jmarantz avatar Sep 01 '24 21:09 jmarantz

I have a new idea. See source/common/protobuf/yaml_utility.cc, function utf8CoerceToStructurallyValid.

Based on the logic in there we can write a scanner to see if utf8 is structurally valid. If it's not, just escape everything to unicode. If it is, then run the nlohmann sanitizer with the fffd replacement. The tests as is should pass.

Note that function actually does its own destructive replacement and so we need to make a parallel funciton that is just a predicate.

We will of course only run this logic when needSlowSanitize() is true.

WDYT?

Let me check it. I am okay to do that if we have exist logic 🙂

wbpcode avatar Sep 02 '24 00:09 wbpcode

I see, the IsStructurallyValid of protobuf could be used.

wbpcode avatar Sep 02 '24 01:09 wbpcode

Yes I think so. I didn't see it has part of the protobuf library, but I do see it in ./external/com_google_protobuf/third_party/utf8_range/_virtual_includes/utf8_validity/utf8_validity.h. In source/common/protobuf/yaml_utility.cc it just includes "utf8_validity.h" so I think you can copy the dependency from the bazel build rule yaml_utility.cc. I don't see this as part of protobufs, but maybe protobufs also depends on it.

jmarantz avatar Sep 02 '24 13:09 jmarantz

Use the https://github.com/envoyproxy/envoy/pull/35936

wbpcode avatar Sep 02 '24 13:09 wbpcode