commonmarker icon indicating copy to clipboard operation
commonmarker copied to clipboard

Error "incompatible character encodings: UTF-8 and ASCII-8BIT" when combined with a rails app

Open oboxodo opened this issue 10 years ago • 27 comments

I think this might not be a commonmarker problem, BUT the error is not raised when using pandoc-ruby nor redcarpet, so it has something to do with commonmarker.

Here you can see a test run from the command line with both cmark and commonmarker and there's no problem:

$ cat test-curly-quotes.md
This curly quote “makes commonmarker throw an exception”.

$ cmark --version
cmark 0.20.0 - CommonMark converter
(C) 2014, 2015 John MacFarlane

$ cmark test-curly-quotes.md
<p>This curly quote “makes commonmarker throw an exception”.</p>

$ gem list --local commonmarker

*** LOCAL GEMS ***

commonmarker (0.2.0)

$ cat test-curly-quotes.md | ruby -r commonmarker -e "puts CommonMarker.render_html(gets)"
<p>This curly quote “makes commonmarker throw an exception”.</p>

That said, I'm testing different markdown parsers/renderers for our rails 4.1.12 (ruby 2.2.2) based app and I'm getting the following error:

ActionView::Template::Error (incompatible character encodings: UTF-8 and ASCII-8BIT):
    12:       - if user_signed_in?
    13:         .outline-content
    14:           = commonmarker_markdown(@quimbee_outline.source)
  app/views/outlines/show.html.slim:15:in `_app_views_outlines_show_html_slim___3317075370232322437_70158621096300'


  Rendered /Users/oboxodo/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/templates/rescues/_trace.html.erb (2.9ms)
  Rendered /Users/oboxodo/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/templates/rescues/_request_and_response.html.erb (1.7ms)
  Rendered /Users/oboxodo/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/templates/rescues/template_error.html.erb within rescues/layout (69.1ms)

I have these helpers:

# encoding: UTF-8
module ApplicationHelper
  def commonmarker_markdown(text)
    CommonMarker.render_html(text, :smart).html_safe
  end

  def pandoc_markdown(text)
    converter = PandocRuby.new(text, from: :markdown, to: :html)
    converter.convert.html_safe
  end

  def redcarpet_markdown(text)
    # ...
  end
end

Changing the call to commonmarker_markdown to either pandoc_markdown or redcarpet_markdown renders the expected result with no errors.

It's not a DB (postgresql) encoding problem either as hardcoding the test phrase in place of the text variable (no DB involved) causes the same problem.

Any ideas about what could be happening?

oboxodo avatar Jul 07 '15 00:07 oboxodo

Amazing write up, thank you. I'll take a look at this within the day. There might need to be a forced UTF-8 encoding.

gjtorikian avatar Jul 07 '15 02:07 gjtorikian

I have bad news and good news.

The bad news is, I cannot get the exception to throw. I started a new rails project, jumped into console, and tried to see what would happen if I passed the same data:

irb(main):007:0> require 'commonmarker'
=> false
irb(main):008:0> CommonMarker::VERSION
=> "0.2.0"
irb(main):009:0> c = "This curly quote “makes commonmarker throw an exception”."
=> "This curly quote “makes commonmarker throw an exception”."
irb(main):010:0> CommonMarker.render_html(c, :smart).html_safe
=> "<p>This curly quote \xE2\x80\x9Cmakes commonmarker throw an exception\xE2\x80\x9D.</p>\n"

The "good" news is that there's definitely something weird going on with those escape codes. I would expect “...” to come back. It does worry me that I can't reproduce the exception, though.

I wonder if this is specific to ActionView::Template. The "quick" answer would be to append .force_encoding('UTF-8'):

irb(main):011:0> CommonMarker.render_html(c, :smart).force_encoding('utf-8')
=> "<p>This curly quote “makes commonmarker throw an exception”.</p>\n"

But that seems wrong/unfair/not the responsibility of the consumer.

gjtorikian avatar Jul 07 '15 03:07 gjtorikian

But that seems wrong/unfair/not the responsibility of the consumer.

To finish my thought: probably this library should do the force_encoding. Could you verify that force_encoding fixes the problem for you? If so I'll do a patch release for this.

gjtorikian avatar Jul 07 '15 03:07 gjtorikian

You nailed it! It works. Thanks.

oboxodo avatar Jul 07 '15 04:07 oboxodo

BTW... I'm using slim. Maybe it's related?

oboxodo avatar Jul 07 '15 05:07 oboxodo

@gjtorikian we're seeing this same issue. Is there a way to traverse all nodes and convert each to utf8? Any pointers you can provide would be greatly appreciated!

duhaime avatar Mar 10 '22 21:03 duhaime

Which version of commonmarker are you using?

gjtorikian avatar Mar 11 '22 13:03 gjtorikian

@gjtorikian We're on version 0.23.4

duhaime avatar Mar 12 '22 13:03 duhaime

So you can absolutely walk the AST tree: https://github.com/gjtorikian/commonmarker#example-walking-the-ast

But that's very slow/time-consuming, and ideally shouldn't be necessary. Are you able to share your markdown doc or create a small (failing) test to show the error?

gjtorikian avatar Mar 13 '22 22:03 gjtorikian

@gjtorikian thank you for your response. I'm trying to paste a minimal example but it appears Github's editor is stripping out the problematic character from the following:

s = "hello: <https://world.com​>"
doc = CommonMarker.render_doc(s, :DEFAULT)

parsed = ""
doc.walk do |node|
  if node.type == :link
    text_node = node
    text_node = text_node.first_child until [:text, :code].include? text_node.type
    if node.url.include?(text_node.string_content)
      puts(node.url)
    end
  end
end

You may need to insert the missing 0x200b character locally so as to achieve:

Screen Shot 2022-03-13 at 8 50 47 PM

We solved this problem with:

s = "hello: <https://world.com​>"
doc = CommonMarker.render_doc(s, :DEFAULT)

parsed = ""
doc.walk do |node|
  if node.type == :link
    text_node = node
    text_node = text_node.first_child until [:text, :code].include? text_node.type
    if node.url.force_encoding("UTF-8").include?(text_node.string_content.force_encoding("UTF-8"))
      puts(node.url)
    end
  end
end

but it would be great if commonmarker gave us an option to treat the whole document's tree as utf-8, so we don't need to force all encodings. Would that be feasible?

duhaime avatar Mar 14 '22 00:03 duhaime

Yes, it should be. I agree that forcing the encoding is not ideal!

gjtorikian avatar Mar 14 '22 21:03 gjtorikian

@duhaime Hm. One thing that's different here is that when I run your code, with the encoded character placed, my tree doesn't recognize any link nodes at all:

#<CommonMarker::Node(document): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34} children=[#<CommonMarker::Node(paragraph): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34} children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34}, string_content="hello: <https://world.com<0x200b>>">]>]>
#<CommonMarker::Node(paragraph): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34} children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34}, string_content="hello: <https://world.com<0x200b>>">]>
#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>34}, string_content="hello: <https://world.com<0x200b>>">

Could you change your sample code to

doc.walk do |node|
  puts node
  # ...
end

And list the walked nodes as I've done here?

gjtorikian avatar Mar 14 '22 21:03 gjtorikian

Hmm, the plot thickens!

I get:

=> "hello: <https://world.com​>"
=> #<CommonMarker::Node(document): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>29} children=[#<CommonMarker::Node(paragraph): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>29} children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>1, :end_line=>1, :end_column=>7}, string_content="hello: ">, #<CommonMarker::Node(link): sourcepos={:start_line=>1, :start_column=>8, :end_line=>1, :end_column=>29}, url="https://world.com\xE2\x80\x8B", title="" children=[#<CommonMarker::Node(text): sourcepos={:start_line=>1, :start_column=>9, :end_line=>1, :end_column=>28}, string_content="https://world.com​">]>]>]>

=> ""
#<CommonMarker::Node:0x000000010e422980>
#<CommonMarker::Node:0x000000010e4daf30>
#<CommonMarker::Node:0x000000010e6ae730>
#<CommonMarker::Node:0x000000010e6ae140>
Traceback (most recent call last):
        3: from (irb):31
        2: from (irb):36:in `block in irb_binding'
        1: from (irb):36:in `include?'
Encoding::CompatibilityError (incompatible character encodings: ASCII-8BIT and UTF-8)

And:

> CommonMarker::VERSION
=> "0.23.4"

Why would these results look so different? I'm using the Rails console instead of irb above--is that relevant?

duhaime avatar Mar 15 '22 00:03 duhaime

Ah, I think I misread your example. The string is literally <https://world.com​0x200b>, not <https://world.com​<0x200b>>. Is that right?

gjtorikian avatar Mar 15 '22 15:03 gjtorikian

Oh no, sorry, it should be exactly as it appears in the image above (the latter in your comment above).

duhaime avatar Mar 15 '22 16:03 duhaime

Strange!

What version of Ruby do you have running?

gjtorikian avatar Mar 15 '22 21:03 gjtorikian

2.6.8 via rbenv:

(base) % ruby -v
ruby 2.6.8p205 (2021-07-07 revision 67951) [arm64-darwin21]
(base) % which ruby
/Users/doug/.rbenv/shims/ruby

duhaime avatar Mar 16 '22 14:03 duhaime

I simply can't reproduce this. And even CI, running Ruby 2.6.6 on Windows/Ubuntu/MacOS. I booted a Rails 7 app to test the logic in the console, and it worked fine, too.

Just to be extra explicit, this is the code I'm using to test:

    str = "hello: <https://world.com<0x200b>>"
    doc = CommonMarker.render_doc(str, :DEFAULT)

    doc.walk do |node|
      puts node.type
    end

A couple of things to note:

  1. GitHub's editor isn't stripping out that character, so I'm not sure why it is for you
  2. This still doesn't detect a link, and causes no "incompatible character encoding" errors.

I'm afraid without more information I'm not sure what I can do to solve this.

gjtorikian avatar Mar 16 '22 21:03 gjtorikian

Ah I think your example just needs to be updated. Your snippet has:

str = "hello: <https://world.com<0x200b>>"

In this case, your string literally contains the characters in the Unicode character that's causing the issue. I think we just need to update the string you're using. As it turns out, the string I posted initially (s = "hello: <https://world.com​>") does contain the character--you should see it if you paste it in your Rails console:

Screen Shot 2022-03-16 at 7 14 24 PM

You can see the codepoints of the string if you use s.unpack('U*') [and can combine the codepoints back into a string like so: s.unpack('U*').pack("U*")].

Does this help you reproduce the situation?

duhaime avatar Mar 16 '22 23:03 duhaime

Got it. In ruby the convention is to use \u to indicate a unicode hexadecimal:

irb(main):006:0> str = "hello: <https://world.com\u200b>"
=> "hello: <https://world.com​ >"

I can now reproduce the problem; now we're getting somewhere.

gjtorikian avatar Mar 17 '22 17:03 gjtorikian

Oh, and how what's the code snippet for how you're rendering the string? CommonMarker.render_doc(str, :DEFAULT).to_html ?

gjtorikian avatar Mar 17 '22 20:03 gjtorikian

Yes, or CommonMarker.render_doc(str, :DEFAULT).to_plaintext

duhaime avatar Mar 17 '22 23:03 duhaime

@duhaime Can you try pointing the gem to the encodaroni branch? I believe this will have the fix, and if so, I will push out a new bug release.

gjtorikian avatar Mar 18 '22 19:03 gjtorikian

Hmm, the change looks good but I'm still getting the same error. This must be user error. Here's what I'm doing:

gem uninstall commonmarker
git clone https://github.com/gjtorikian/commonmarker
cd commonmarker && gem build commonmarker.gemspec
gem install commonmarker-0.23.4.gem
irb

Then in the irb console:

require 'commonmarker'

s = "hello: <https://world.com​>"
doc = CommonMarker.render_doc(s, :DEFAULT)

parsed = ""
doc.walk do |node|
  if node.type == :link
    text_node = node
    text_node = text_node.first_child until [:text, :code].include? text_node.type
    if node.url.include?(text_node.string_content)
      puts(node.url)
    end
  end
end

Which throws:

Traceback (most recent call last):
       13: from /Users/doug/.rbenv/versions/2.6.8/bin/irb:23:in `<main>'
       12: from /Users/doug/.rbenv/versions/2.6.8/bin/irb:23:in `load'
       11: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
       10: from (irb):7
        9: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:17:in `walk'
        8: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:72:in `each'
        7: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:18:in `block in walk'
        6: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:17:in `walk'
        5: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:72:in `each'
        4: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:18:in `block in walk'
        3: from /Users/doug/.rbenv/versions/2.6.8/lib/ruby/gems/2.6.0/gems/commonmarker-0.23.4/lib/commonmarker/node.rb:16:in `walk'
        2: from (irb):11:in `block in irb_binding'
        1: from (irb):11:in `include?'
Encoding::CompatibilityError (incompatible character encodings: ASCII-8BIT and UTF-8)

Should I be doing something differently to test?

duhaime avatar Mar 21 '22 12:03 duhaime

With the repo cloned, try:

  • script/bootstrap
  • bundle exec rake clean compile test

gjtorikian avatar Mar 21 '22 17:03 gjtorikian

Interesting, I ran those steps on a fresh rbenv env, and I still get the same result. Do you get a different result with the code block I posted above?

duhaime avatar Mar 21 '22 17:03 duhaime

Interesting, I ran those steps on a fresh rbenv env, and I still get the same result. Do you get a different result with the code block I posted above?

Oh shoot, I do. Ok. I'll make time for this today.

gjtorikian avatar Mar 27 '22 14:03 gjtorikian

Due to https://github.com/gjtorikian/commonmarker/pull/186, walking over nodes has been removed in v1.0.0. Users can use https://github.com/gjtorikian/html-pipeline if they wish to iterate over HTML after the fact.

gjtorikian avatar Nov 03 '22 20:11 gjtorikian