redcarpet icon indicating copy to clipboard operation
redcarpet copied to clipboard

Character encoding issue with autolinking

Open whatupdave opened this issue 10 years ago • 16 comments

Not sure what's causing this:

> ruby -e "require 'redcarpet'; puts Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: true).render('[email protected]ü')"
<p><a href="mailto:[email protected]%C3">[email protected]�</a>�</p>

› ruby -e "require 'redcarpet'; puts Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: true).render('[email protected]ü').inspect"
"<p><a href=\"mailto:[email protected]%C3\">[email protected]\xC3</a>\xBC</p>\n"

It's fine without autolinking:

› ruby -e "require 'redcarpet'; puts Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: false).render('[email protected]ü')"
<p>[email protected]ü</p>

whatupdave avatar Jun 10 '14 17:06 whatupdave

Yup, I've just hit this same issue.

neilmiddleton avatar Sep 23 '14 22:09 neilmiddleton

I've hit this same issue, too.

It will spilt out my UTF-8 char, into link with first part of bytes and other bytes keep outside the link.

example [email protected]\u300D into <a href="mailto:[email protected]%E3">mailto:[email protected]\xE3</a>\x80\x8D

david50407 avatar Feb 06 '15 18:02 david50407

I'm having the same issue as well. Any ideas of a fix for this @vmg?

ericgoodwin avatar Feb 06 '15 21:02 ericgoodwin

I think the problem is the same as https://github.com/vmg/redcarpet/pull/358

But why a UTF-8 char can be splited...

david50407 avatar Feb 07 '15 06:02 david50407

I've traced the code and extract the function of sd_autolink__email into my test code, but it works well.

It's so wired, because after copying the link into buffer in sd_autolink__email it calls the callback of autolink with passing the link.

But if sd_autolink__email is functioning normally, the callback wouldn't get the wrong link.

david50407 avatar Feb 07 '15 09:02 david50407

BTW, Rinku has the same issue. https://github.com/vmg/rinku

david50407 avatar Feb 07 '15 09:02 david50407

I found the point here: https://github.com/vmg/redcarpet/blob/master/ext/redcarpet/autolink.c#L227

    for (link_end = 0; link_end < size; ++link_end) {
        uint8_t c = data[link_end];

        if (isalnum(c)) /* HERE */
            continue;

        if (c == '@')
            nb++;
        else if (c == '.' && link_end < size - 1)
            np++;
        else if (c != '-' && c != '_')
            break;
    }

That when passing (\xE3\x80\x8D), it returns TRUE from isalnum(0xE3).

When I modified the if statement into if (isalnum(c) && c < 0x7f), it works fine.

david50407 avatar Feb 07 '15 14:02 david50407

Not sure if it is redcarpet related (or upstream-kramdown), but I have the same problem when header contains a UTF-8 character:

# dupa
## dópa
redcarpet --render with_toc_data test.md
<h1 id="dupa">dupa</h1>
<h2 id="d�pa">dópa</h2>

When jekyll makes a build I get the following exception:

Liquid Exception: invalid byte sequence in UTF-8 in feed.xml
jekyll 2.4.0 | Error:  invalid byte sequence in UTF-8

Normally I'd use an urlify implementation like this one: https://github.com/beastaugh/urlify, but it seems that the escaping is done with C… well I don't have a slightest idea how to debug it ;)

@vmg hope it helps someway :)

ryrych avatar Mar 19 '16 00:03 ryrych

I'm getting invalid byte sequence in UTF-8, trying to render markdown w/ redcarpet on the following char, but only if it's in the (bash) code block. Outside of the codeblock it works fine. The char is on the first line of the code block.

```bash ¢ ```

MadPositron avatar Sep 06 '18 20:09 MadPositron

I'm still getting this issue when using autolinking. UTF-8 characters are being split apart when they appear after a piece of text that will be autolinked. For instance:

Email me at “[email protected]

Is going to cause problems. Is there a fix for this?

mdchaney avatar Oct 09 '19 05:10 mdchaney

@mdchaney patch is already here... https://github.com/vmg/redcarpet/pull/463

david50407 avatar Oct 13 '19 06:10 david50407

Okay, I'll just pull from repo then. Are there plans of another release?

mdchaney avatar Oct 13 '19 21:10 mdchaney

I have no idea that is this repo going to merge the patch or not. So, just apply the patch by yourself. lol

david50407 avatar Oct 14 '19 10:10 david50407

Yeah, I realized that. Ugh. Looks like redcarpet has been abandoned - one of us probably should fork it and apply outstanding merge requests. This particular one is a biggy.

mdchaney avatar Oct 14 '19 18:10 mdchaney

@vmg - Any chance of a fix for this? This one is bitting me as well. This bug can be easily reproduced like this:

renderer = Redcarpet::Render::HTML.new(with_toc_data: true)
md = Redcarpet::Markdown.new(renderer, no_intra_emphasis: true, tables: true, autolink: true, quote: true)
md.render("“[email protected]“")

# => "<p>“<a href=\"mailto:[email protected]%E2\">[email protected]\xE2</a>\x80\x9C</p>\n"
# irb(main):008:0> md.render("“[email protected]“").valid_encoding?
# => false

jstewart avatar Dec 30 '20 14:12 jstewart

Just checked why we are maintaining an own fork as well. @robin850 thanks for your last merges and releases. Do you see any chance to merge this one? Do you need any help?

fwolfst avatar Jul 19 '24 12:07 fwolfst