truncate_html truncate_html does not respect Unicode

Hi @hgmnz,

A client is running some content with Unicode characters (namely, an up arrow) through truncate_html and noticing that those characters are disappearing.

I've narrowed it down to the scan in TruncateHtml::HtmlString. However, that's a hell of a regex to read, so I was wondering if you wouldn't mind walking me through it.

You can paste this code into an .rb file and run it to see what I mean:

# encoding: utf-8
unicode_string = "Up Arrow (↑) points up."

# From TruncateHtml::HtmlString
# 
def regex
  /(?:<script.*>.*<\/script>)+|<\/?[^>]+>|[[[:alpha:]]\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+|[[:punct:]]/
end

# scan normally respects unicode.
puts unicode_string.scan(/.*/).join

# but this regex does not.
puts unicode_string.scan(regex).join

The result at the command line is

Up Arrow (↑) points up.
Up Arrow () points up.

Thanks!

Feb 13 '13 02:02 adamflorin

It's going to take me a little while to go describe the regex I'm afraid, but I'll take this as a bug report and try to fix it soon.

If you get to it sooner, please submit a pull request!

Thanks

Feb 13 '13 19:02 hgmnz

OK, thanks!

Feb 13 '13 19:02 adamflorin

I have the same problem using ruby 2.0.0-p0. It does not happen (to me) with ruby 1.9.3. It seems it uses a new regexp engine, which probably isn't fully backward compatible. I replaced \w with \p{word} (in the regex method) and looks like it solves this, but I'm not sure of the implications.

Mar 30 '13 19:03 dmfrancisco

Oops. It seems this has been solved on master already :smiley: Thanks for the hard work.

Mar 31 '13 13:03 dmfrancisco

Thanks for verifying @dmfrancisco :)

Mar 31 '13 16:03 hgmnz

Sorry @hgmnz, I should have tested this better before commenting. My tests pass for portuguese special characters but I tested the original string provided by @adamflorin and it seems to fail. Example:

truncate_html "café ↑ périferôl"
# => "café  périferôl"

In short, it seems the master branch fixes the issue for alphabets with special characters but not for unicode symbols.

Mar 31 '13 16:03 dmfrancisco

ahhh, thanks. Reopening this then

Mar 31 '13 18:03 hgmnz

Aha，truncate_html filt all the Chinese unicode words, this bug still exists.

Jun 05 '13 12:06 halida

Looks like it works on master, and not work on gem?

Jun 06 '13 00:06 halida

Looks like it works on master, and not work on gem?

Is that the case? There doesn't seem any changes since 0.9.2 that would do that, but it could be accidental

Jun 06 '13 03:06 hgmnz

@hgmnz Yes, http://gurudigger.com/products/tuicool I use truncate_html to implement "More" on this page。

Jun 06 '13 03:06 halida

This is broken in version 0.9.2 of the gem.

Oct 15 '13 23:10 alex94040

I confirm, broken in version 0.9.2 and works for me using master branch. What about a 0.9.3 new gem ? ;)

Nov 07 '13 10:11 afriqs

This is particularly painful in HTML use-cases (i.e. truncating stuff from TinyMCE) where random spaces are dropped because the   character is not respected.

The second space is the 2 byte character Unicode for  

[34] pry(main)> truncate_html("what about this: ↑")
=> "what aboutthis:"

Using 0.9.2

Jul 17 '14 00:07 aguynamedben

I found this library that does not drop Unicode characters. https://github.com/nono/HTML-Truncator

Time for a beer!

Jul 17 '14 01:07 aguynamedben

This is still an issue — emoji disappears 😢

Nov 23 '15 04:11 lachlanjc

I confirm, version 0.9.3 removes Euro (€) and UK Pound Sterling (£) symbols.

Feb 05 '16 09:02 togiberlin