Error "incompatible character encodings: UTF-8 and ASCII-8BIT" when combined with a rails app
I think this might not be a commonmarker problem, BUT the error is not raised when using pandoc-ruby nor redcarpet, so it has something to do with commonmarker.
Here you can see a test run from the command line with both cmark and commonmarker and there's no problem:
$ cat test-curly-quotes.md
This curly quote “makes commonmarker throw an exception”.
$ cmark --version
cmark 0.20.0 - CommonMark converter
(C) 2014, 2015 John MacFarlane
$ cmark test-curly-quotes.md
<p>This curly quote “makes commonmarker throw an exception”.</p>
$ gem list --local commonmarker
*** LOCAL GEMS ***
commonmarker (0.2.0)
$ cat test-curly-quotes.md | ruby -r commonmarker -e "puts CommonMarker.render_html(gets)"
<p>This curly quote “makes commonmarker throw an exception”.</p>
That said, I'm testing different markdown parsers/renderers for our rails 4.1.12 (ruby 2.2.2) based app and I'm getting the following error:
ActionView::Template::Error (incompatible character encodings: UTF-8 and ASCII-8BIT):
12: - if user_signed_in?
13: .outline-content
14: = commonmarker_markdown(@quimbee_outline.source)
app/views/outlines/show.html.slim:15:in `_app_views_outlines_show_html_slim___3317075370232322437_70158621096300'
Rendered /Users/oboxodo/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/templates/rescues/_trace.html.erb (2.9ms)
Rendered /Users/oboxodo/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/templates/rescues/_request_and_response.html.erb (1.7ms)
Rendered /Users/oboxodo/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/templates/rescues/template_error.html.erb within rescues/layout (69.1ms)
I have these helpers:
# encoding: UTF-8
module ApplicationHelper
def commonmarker_markdown(text)
CommonMarker.render_html(text, :smart).html_safe
end
def pandoc_markdown(text)
converter = PandocRuby.new(text, from: :markdown, to: :html)
converter.convert.html_safe
end
def redcarpet_markdown(text)
# ...
end
end
Changing the call to commonmarker_markdown to either pandoc_markdown or redcarpet_markdown renders the expected result with no errors.
It's not a DB (postgresql) encoding problem either as hardcoding the test phrase in place of the text variable (no DB involved) causes the same problem.
Any ideas about what could be happening?
Amazing write up, thank you. I'll take a look at this within the day. There might need to be a forced UTF-8 encoding.
I have bad news and good news.
The bad news is, I cannot get the exception to throw. I started a new rails project, jumped into console, and tried to see what would happen if I passed the same data:
irb(main):007:0> require 'commonmarker'
=> false
irb(main):008:0> CommonMarker::VERSION
=> "0.2.0"
irb(main):009:0> c = "This curly quote “makes commonmarker throw an exception”."
=> "This curly quote “makes commonmarker throw an exception”."
irb(main):010:0> CommonMarker.render_html(c, :smart).html_safe
=> "<p>This curly quote \xE2\x80\x9Cmakes commonmarker throw an exception\xE2\x80\x9D.</p>\n"
The "good" news is that there's definitely something weird going on with those escape codes. I would expect “...” to come back. It does worry me that I can't reproduce the exception, though.
I wonder if this is specific to ActionView::Template. The "quick" answer would be to append .force_encoding('UTF-8'):
irb(main):011:0> CommonMarker.render_html(c, :smart).force_encoding('utf-8')
=> "<p>This curly quote “makes commonmarker throw an exception”.</p>\n"
But that seems wrong/unfair/not the responsibility of the consumer.
But that seems wrong/unfair/not the responsibility of the consumer.
To finish my thought: probably this library should do the force_encoding. Could you verify that force_encoding fixes the problem for you? If so I'll do a patch release for this.
You nailed it! It works. Thanks.
BTW... I'm using slim. Maybe it's related?
@gjtorikian we're seeing this same issue. Is there a way to traverse all nodes and convert each to utf8? Any pointers you can provide would be greatly appreciated!
Which version of commonmarker are you using?
@gjtorikian We're on version 0.23.4
So you can absolutely walk the AST tree: https://github.com/gjtorikian/commonmarker#example-walking-the-ast
But that's very slow/time-consuming, and ideally shouldn't be necessary. Are you able to share your markdown doc or create a small (failing) test to show the error?
@gjtorikian thank you for your response. I'm trying to paste a minimal example but it appears Github's editor is stripping out the problematic character from the following:
s = "hello: <https://world.com>"
doc = CommonMarker.render_doc(s, :DEFAULT)
parsed = ""
doc.walk do |node|
if node.type == :link
text_node = node
text_node = text_node.first_child until [:text, :code].include? text_node.type
if node.url.include?(text_node.string_content)
puts(node.url)
end
end
end
You may need to insert the missing 0x200b character locally so as to achieve:
We solved this problem with:
s = "hello: <https://world.com>"
doc = CommonMarker.render_doc(s, :DEFAULT)
parsed = ""
doc.walk do |node|
if node.type == :link
text_node = node
text_node = text_node.first_child until [:text, :code].include? text_node.type
if node.url.force_encoding("UTF-8").include?(text_node.string_content.force_encoding("UTF-8"))
puts(node.url)
end
end
end
but it would be great if commonmarker gave us an option to treat the whole document's tree as utf-8, so we don't need to force all encodings. Would that be feasible?
Yes, it should be. I agree that forcing the encoding is not ideal!
@duhaime Hm. One thing that's different here is that when I run your code, with the encoded character placed, my tree doesn't recognize any link nodes at all:
#<CommonMarker::Node(document): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34} children=[#<CommonMarker::Node(paragraph): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34} children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34}, string_content="hello: <https://world.com<0x200b>>">]>]>
#<CommonMarker::Node(paragraph): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34} children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34}, string_content="hello: <https://world.com<0x200b>>">]>
#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34}, string_content="hello: <https://world.com<0x200b>>">
Could you change your sample code to
doc.walk do |node|
puts node
# ...
end
And list the walked nodes as I've done here?
Hmm, the plot thickens!
I get:
=> "hello: <https://world.com>"
=> #<CommonMarker::Node(document): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>29} children=[#<CommonMarker::Node(paragraph): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>29} children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>7}, string_content="hello: ">, #<CommonMarker::Node(link): sourcepos={:start_line=>1, :start_column=>8, :end_line=>1, :end_column=>29}, url="https://world.com\xE2\x80\x8B", title="" children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>9, :end_line=>1, :end_column=>28}, string_content="https://world.com">]>]>]>
=> ""
#<CommonMarker::Node:0x000000010e422980>
#<CommonMarker::Node:0x000000010e4daf30>
#<CommonMarker::Node:0x000000010e6ae730>
#<CommonMarker::Node:0x000000010e6ae140>
Traceback (most recent call last):
3: from (irb):31
2: from (irb):36:in `block in irb_binding'
1: from (irb):36:in `include?'
Encoding::CompatibilityError (incompatible character encodings: ASCII-8BIT and UTF-8)
And:
> CommonMarker::VERSION
=> "0.23.4"
Why would these results look so different? I'm using the Rails console instead of irb above--is that relevant?
Ah, I think I misread your example. The string is literally <https://world.com0x200b>, not <https://world.com<0x200b>>. Is that right?
Oh no, sorry, it should be exactly as it appears in the image above (the latter in your comment above).
Strange!
What version of Ruby do you have running?
2.6.8 via rbenv:
(base) % ruby -v
ruby 2.6.8p205 (2021-07-07 revision 67951) [arm64-darwin21]
(base) % which ruby
/Users/doug/.rbenv/shims/ruby
I simply can't reproduce this. And even CI, running Ruby 2.6.6 on Windows/Ubuntu/MacOS. I booted a Rails 7 app to test the logic in the console, and it worked fine, too.
Just to be extra explicit, this is the code I'm using to test:
str = "hello: <https://world.com<0x200b>>"
doc = CommonMarker.render_doc(str, :DEFAULT)
doc.walk do |node|
puts node.type
end
A couple of things to note:
- GitHub's editor isn't stripping out that character, so I'm not sure why it is for you
- This still doesn't detect a link, and causes no "incompatible character encoding" errors.
I'm afraid without more information I'm not sure what I can do to solve this.
Ah I think your example just needs to be updated. Your snippet has:
str = "hello: <https://world.com<0x200b>>"
In this case, your string literally contains the characters in the Unicode character that's causing the issue. I think we just need to update the string you're using. As it turns out, the string I posted initially (s = "hello: <https://world.com>") does contain the character--you should see it if you paste it in your Rails console:
You can see the codepoints of the string if you use s.unpack('U*') [and can combine the codepoints back into a string like so: s.unpack('U*').pack("U*")].
Does this help you reproduce the situation?
Got it. In ruby the convention is to use \u to indicate a unicode hexadecimal:
irb(main):006:0> str = "hello: <https://world.com\u200b>"
=> "hello: <https://world.com >"
I can now reproduce the problem; now we're getting somewhere.
Oh, and how what's the code snippet for how you're rendering the string? CommonMarker.render_doc(str, :DEFAULT).to_html ?
Yes, or CommonMarker.render_doc(str, :DEFAULT).to_plaintext
@duhaime Can you try pointing the gem to the encodaroni branch? I believe this will have the fix, and if so, I will push out a new bug release.
Hmm, the change looks good but I'm still getting the same error. This must be user error. Here's what I'm doing:
gem uninstall commonmarker
git clone https://github.com/gjtorikian/commonmarker
cd commonmarker && gem build commonmarker.gemspec
gem install commonmarker-0.23.4.gem
irb
Then in the irb console:
require 'commonmarker'
s = "hello: <https://world.com>"
doc = CommonMarker.render_doc(s, :DEFAULT)
parsed = ""
doc.walk do |node|
if node.type == :link
text_node = node
text_node = text_node.first_child until [:text, :code].include? text_node.type
if node.url.include?(text_node.string_content)
puts(node.url)
end
end
end
Which throws:
Traceback (most recent call last):
13: from /Users/doug/.rbenv/versions/2.6.8/bin/irb:23:in `<main>'
12: from /Users/doug/.rbenv/versions/2.6.8/bin/irb:23:in `load'
11: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
10: from (irb):7
9: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:17:in `walk'
8: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:72:in `each'
7: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:18:in `block in walk'
6: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:17:in `walk'
5: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:72:in `each'
4: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:18:in `block in walk'
3: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:16:in `walk'
2: from (irb):11:in `block in irb_binding'
1: from (irb):11:in `include?'
Encoding::CompatibilityError (incompatible character encodings: ASCII-8BIT and UTF-8)
Should I be doing something differently to test?
With the repo cloned, try:
-
script/bootstrap -
bundle exec rake clean compile test
Interesting, I ran those steps on a fresh rbenv env, and I still get the same result. Do you get a different result with the code block I posted above?
Interesting, I ran those steps on a fresh rbenv env, and I still get the same result. Do you get a different result with the code block I posted above?
Oh shoot, I do. Ok. I'll make time for this today.
Due to https://github.com/gjtorikian/commonmarker/pull/186, walking over nodes has been removed in v1.0.0. Users can use https://github.com/gjtorikian/html-pipeline if they wish to iterate over HTML after the fact.