kaml icon indicating copy to clipboard operation
kaml copied to clipboard

Wrong encoding used for stream methods

Open Vampire opened this issue 2 years ago • 4 comments

Describe the bug

The current code uses for example InputStreamReader(source). This uses the JVM default encoding. From Java 18 on, this will usually be UTF-8 but can also be reconfigured. Before Java 18, this is system dependent and often is not UTF-8. And even if it were always UTF-8, Yaml parsers have to support UTF-8 and UTF-16 with and without BOM for 1.1 and also UTF-32 for 1.2.

From a quick look the writing stream methods always use UTF-8, for those it might be handy to be able to supply the wanted encoding.

But more important, for the reading stream methods you need a detection algorithm with a push-back stream, so that you can detect BOM and act accordingly and if none found, fall back to UTF-8.

You could also consider adding methods that take a Reader / Writer instead as there the encoding is already given by the caller.

Reproduction repo

No response

Steps to reproduce

  • have a YAML with ö in it with UTF-8 encoding
  • set the jvm default encoding to ISO-8859-1
  • deserialize the YAML and you get ö instead of the ö

Expected behaviour

You get ö

Actual behaviour

You get ö

Version information

0.46.0

Any other information

No response

Vampire avatar Jul 12 '22 00:07 Vampire

Thanks for the bug report @Vampire. I'll take a look.

charleskorn avatar Jul 12 '22 07:07 charleskorn

You could also consider adding methods that take a Reader / Writer instead as there the encoding is already given by the caller.

Unfortunately, the underlying YAML library kaml uses (SnakeYAML) doesn't support using Writer instances, so I've avoided adding support for either Reader or Writer to maintain the symmetry of the API.

It would be pretty straightforward for the existing encodeToStream and decodeFromStream methods to take a Charset parameter, and default to UTF-8 - would that work for you?

charleskorn avatar Jul 12 '22 07:07 charleskorn

Well, it works at least better, even if it is not fully spec compliant. :-) Maybe SnakeYAML should fix it for the reading then though.

Vampire avatar Jul 12 '22 17:07 Vampire

Might be worth raising a bug with SnakeYAML directly then.

Is there still value in adding a Charset parameter to encodeToStream and decodeFromStream? Or did you have another solution in mind?

charleskorn avatar Jul 13 '22 07:07 charleskorn

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. It will automatically be closed if no further activity occurs in the next seven days to enable maintainers to focus on the most important issues. If this issue is still affecting you, please comment below within the next seven days. Thank you for your contributions.

stale[bot] avatar Sep 11 '22 07:09 stale[bot]

This issue has been automatically closed because it has not had any recent activity.

stale[bot] avatar Sep 18 '22 08:09 stale[bot]