asciidoctorj icon indicating copy to clipboard operation
asciidoctorj copied to clipboard

Encoding issue with rendered HTML

Open jkost opened this issue 8 years ago • 5 comments

Hallo,

we have a strange issue when converting .adoc to html. I use the following .adoc file which contains some greek characters (but I believe the issue exists with other non english characters, e.g. french, german etc.).

= Sample Document

== Section one This is content of section one

== Section two And content of section two. Ελληνικά;

The produced output is the following (att: asciidoc.zip): asciidoc.zip and the output is rendered like so: screen shot 2017-01-14 at 12 08 36 After I comment out the line (already commented-out in the attached asciidoc.zip):

<!--<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Open+Sans:300,300italic,400,400italic,600,600italic%7CNoto+Serif:400,400italic,700,700italic%7CDroid+Sans+Mono:400,700">-->

both english and greek characters are rendered correctly (like if you open the .html file in any browser). I use a macbook with MacOSX 10.9.5 and apparently I use Greek.

In another's guy machine though, who doesn't use Greek, the sample adoc will not get rendered as a complete garbage. However, it is still bad but only the Greek characters are garbage. If he comments the line with the stylesheet in the output html, it does not help because those characters are really garbage in the html. Our guess is the Asciidoctor.convert somewhere uses platform default encoding on the passed input string (though it doesn't seem useful for it to convert the string to bytes, it might do so because of some of its internal API). The conversion is done in line 303 of: https://github.com/asciidoctor/asciidoctorj/blob/master/asciidoctorj-core/src/main/java/org/asciidoctor/internal/JRubyAsciidoctor.java Object object = this.asciidoctorModule.convert(content, rubyHash); where asciidoctorModule is created from asciidoctorclass.rb. I don't know where, but somewhere there should be a check of content's encoding before applying the conversion?

jkost avatar Jan 15 '17 18:01 jkost

Hi, I am sorry to say I could not reproduce the issue on Windos 8.1 (Spanish version) and MacOS 10.11.6 (US English). I could properly see the attached html with and without the commented line. In my test, I used the maven plugin to render a file with the content you show, I just copy-pasted from the site into IntelliJ. Note that Asciidoctor enforces content to UTF-8, that doesn't mean that it converts the characters, more like it ignores the original encoding and assumes UTF-8. To be sure we are not missing anything can you answer these:

  • Can you check if the source file is UTF-8? If you're not sure just attach the original file in a zip to the issue.
  • Can you explain how are you rendering the file?

abelsromero avatar Jan 15 '17 23:01 abelsromero

Hallo abelsromero,

actually this bug occurs with the plugin AsciidoctorJ4NB. The asciidoc.html file's content have been pasted during debugging from the variable html in AdocVisualPanel of AsciidoctorJ4NB: String html = AsciidoctorConverter.getDefault().convert(asciidocText, getInitialOptions()); I hereby attach a sample netbeans project which illustrates the problem. At least in my machine. How does it render in yours?

WebViewTest.zip

Thanks.

jkost avatar Jan 16 '17 18:01 jkost

Ia can see correctly the html in the project. But when I run the project all I get is a white window. I debugged and the html is loaded into the string but nothing is show.

abelsromero avatar Jan 17 '17 10:01 abelsromero

Hallo abelsromero,

actually this bug occurs with the plugin AsciidoctorJ4NB. The asciidoc.html file's content have been pasted during debugging from the variable html in AdocVisualPanel of AsciidoctorJ4NB: String html = AsciidoctorConverter.getDefault().convert(asciidocText, getInitialOptions()); I hereby attach a sample netbeans project which illustrates the problem. At least in my machine. How does it render in yours?

WebViewTest.zip

Thanks.

jkost avatar Jan 18 '17 19:01 jkost

Note that Asciidoctor enforces content to UTF-8, that doesn't mean that it converts the characters, more like it ignores the original encoding and assumes UTF-8.

That's absolutely correct. The source document (whether it is a file or a string) has to be encoded in UTF-8 (or UTF-16 with a BOM). Otherwise, you'll get mojibake.

mojavelinux avatar Oct 15 '18 08:10 mojavelinux