Wrong encoding for Java source files with ISO-8859-1
Hi,
we have a massive problem with source file encoding. A lot of projects are encoded in ISO-8859-1. It seems that the encoding is not determined by e.g. a maven property but derived with org.openrewrite.internal.EncodingDetectingInputStream. This leads to wrong file changes. See as a example the project:
https://github.com/thomaszub/rewrite-encoding-bug
I would suggest to derive the encoding from the build system like Maven's project.build.sourceEncoding and only use heuristics if no encoding can be derived.
Thanks and kind regards Thomas
Hi @thomaszub,
Thanks for the demo project; I can appreciate the scale of this problem. Looking at the docs, we may need to consider both sources and resources.
Hi @pway99,
I made a PR which extends Parser, Parser.Input and EncodingDetectingInputStream with the possibility to set a Charset. This works with the maven-plugin if changed accordingly (I can make a PR for the maven-plugin). As I'm not familiar enough with the gradle-plugin or the non-Java parsers I would currently consider this PR as incomplete. But maybe this helps with fixing the problem.
Hi Thomas,
Thanks for the PR, Check out the JavaParser.Builder#charset it might simplify things a bit. I'm still unsure how to handle the EncodingDetectingInputStream when the charset is intentionally set, and I will be looking at this also. Perhaps we can team up on this one.
Hi Thomas, I have put up a rewrite-maven PR for setting the charset from mavens sourceEncoding property, and working on a rewrite-java solution that will use the maven-plugin charset when its specified.
Hi Patrick, if I correctly remind this is not enough as the JavaParser will not pass the encoding to the Input and EncodingDetectingInputStream. Maybe we should team up and look together at the code.
Fixed by #2249