rewrite icon indicating copy to clipboard operation
rewrite copied to clipboard

YamlParsing failure if various unicode characters exist in the source file.

Open traceyyoshima opened this issue 2 years ago • 1 comments

Example:

root:
  - value1: 🛠
    value2: check

Exception:

com.fasterxml.jackson.databind.JsonMappingException: Invalid surrogate pair, starts with invalid high surrogate (0xDEE0), not in valid range [0xD800, 0xDBFF] (through reference chain: java.util.ArrayList[6]->org.openrewrite.yaml.tree.Yaml$Documents["documents"]->java.util.ArrayList[0]->org.openrewrite.yaml.tree.Yaml$Document["block"]->org.openrewrite.yaml.tree.Yaml$Mapping["entries"]->java.util.ArrayList[0]->org.openrewrite.yaml.tree.Yaml$Mapping$Entry["value"]->org.openrewrite.yaml.tree.Yaml$Mapping["entries"]->java.util.ArrayList[1]->org.openrewrite.yaml.tree.Yaml$Mapping$Entry["value"]->org.openrewrite.yaml.tree.Yaml$Sequence["entries"]->java.util.ArrayList[0]->org.openrewrite.yaml.tree.Yaml$Sequence$Entry["block"]->org.openrewrite.yaml.tree.Yaml$Mapping["entries"]->java.util.ArrayList[1]->org.openrewrite.yaml.tree.Yaml$Mapping$Entry["prefix"])
> Invalid surrogate pair, starts with invalid high surrogate (0xDEE0), not in valid range [0xD800, 0xDBFF] (through reference chain: java.util.ArrayList[6]->org.openrewrite.yaml.tree.Yaml$Documents["documents"]->java.util.ArrayList[0]->org.openrewrite.yaml.tree.Yaml$Document["block"]->org.openrewrite.yaml.tree.Yaml$Mapping["entries"]->java.util.ArrayList[0]->org.openrewrite.yaml.tree.Yaml$Mapping$Entry["value"]->org.openrewrite.yaml.tree.Yaml$Mapping["entries"]->java.util.ArrayList[1]->org.openrewrite.yaml.tree.Yaml$Mapping$Entry["value"]->org.openrewrite.yaml.tree.Yaml$Sequence["entries"]->java.util.ArrayList[0]->org.openrewrite.yaml.tree.Yaml$Sequence$Entry["block"]->org.openrewrite.yaml.tree.Yaml$Mapping["entries"]->java.util.ArrayList[1]->org.openrewrite.yaml.tree.Yaml$Mapping$Entry["prefix"])
  > Invalid surrogate pair, starts with invalid high surrogate (0xDEE0), not in valid range [0xD800, 0xDBFF]

Based on the stacktrace the encoding may be unsupported: Character encoding for 🛠.

  • UTF-8 Encoding: | 0xF0 0x9F 0x9B 0xA0
  • UTF-16 Encoding: | 0xD83D 0xDEE0
  • UTF-32 Encoding: | 0x0001F6E0

traceyyoshima avatar Jul 19 '22 23:07 traceyyoshima

  • The parsing issue prevents ingesting micronaut projects.
  • WINDOWS-1252 and ISO-8859-1 are not supported in YAML:
    • The InputStreamReader does not pass in a StandarCharset and is always defaulted to UTF-8.
    • The String returned by the ByteArrayInputStream is always set to UTF-8.

traceyyoshima avatar Jul 20 '22 20:07 traceyyoshima

Now skipping Yaml files with unicode characters as of https://github.com/openrewrite/rewrite/pull/3427

Exploring options for a fix in: https://github.com/openrewrite/rewrite/pull/3421

timtebeek avatar Jul 20 '23 14:07 timtebeek