truncate_html does not respect Unicode
Hi @hgmnz,
A client is running some content with Unicode characters (namely, an up arrow) through truncate_html and noticing that those characters are disappearing.
I've narrowed it down to the scan in TruncateHtml::HtmlString. However, that's a hell of a regex to read, so I was wondering if you wouldn't mind walking me through it.
You can paste this code into an .rb file and run it to see what I mean:
# encoding: utf-8
unicode_string = "Up Arrow (↑) points up."
# From TruncateHtml::HtmlString
#
def regex
/(?:<script.*>.*<\/script>)+|<\/?[^>]+>|[[[:alpha:]]\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+|[[:punct:]]/
end
# scan normally respects unicode.
puts unicode_string.scan(/.*/).join
# but this regex does not.
puts unicode_string.scan(regex).join
The result at the command line is
Up Arrow (↑) points up.
Up Arrow () points up.
Thanks!
It's going to take me a little while to go describe the regex I'm afraid, but I'll take this as a bug report and try to fix it soon.
If you get to it sooner, please submit a pull request!
Thanks
OK, thanks!
I have the same problem using ruby 2.0.0-p0. It does not happen (to me) with ruby 1.9.3. It seems it uses a new regexp engine, which probably isn't fully backward compatible. I replaced \w with \p{word} (in the regex method) and looks like it solves this, but I'm not sure of the implications.
Oops. It seems this has been solved on master already :smiley: Thanks for the hard work.
Thanks for verifying @dmfrancisco :)
Sorry @hgmnz, I should have tested this better before commenting. My tests pass for portuguese special characters but I tested the original string provided by @adamflorin and it seems to fail. Example:
truncate_html "café ↑ périferôl"
# => "café périferôl"
In short, it seems the master branch fixes the issue for alphabets with special characters but not for unicode symbols.
ahhh, thanks. Reopening this then
Aha,truncate_html filt all the Chinese unicode words, this bug still exists.
Looks like it works on master, and not work on gem?
Looks like it works on master, and not work on gem?
Is that the case? There doesn't seem any changes since 0.9.2 that would do that, but it could be accidental
@hgmnz Yes, http://gurudigger.com/products/tuicool I use truncate_html to implement "More" on this page。
This is broken in version 0.9.2 of the gem.
I confirm, broken in version 0.9.2 and works for me using master branch. What about a 0.9.3 new gem ? ;)
This is particularly painful in HTML use-cases (i.e. truncating stuff from TinyMCE) where random spaces are dropped because the character is not respected.
The second space is the 2 byte character Unicode for
[34] pry(main)> truncate_html("what about this: ↑")
=> "what aboutthis:"
Using 0.9.2
I found this library that does not drop Unicode characters. https://github.com/nono/HTML-Truncator
Time for a beer!
This is still an issue — emoji disappears 😢
I confirm, version 0.9.3 removes Euro (€) and UK Pound Sterling (£) symbols.