truncate_html icon indicating copy to clipboard operation
truncate_html copied to clipboard

truncate_html does not respect Unicode

Open adamflorin opened this issue 12 years ago • 17 comments

Hi @hgmnz,

A client is running some content with Unicode characters (namely, an up arrow) through truncate_html and noticing that those characters are disappearing.

I've narrowed it down to the scan in TruncateHtml::HtmlString. However, that's a hell of a regex to read, so I was wondering if you wouldn't mind walking me through it.

You can paste this code into an .rb file and run it to see what I mean:

# encoding: utf-8
unicode_string = "Up Arrow (↑) points up."

# From TruncateHtml::HtmlString
# 
def regex
  /(?:<script.*>.*<\/script>)+|<\/?[^>]+>|[[[:alpha:]]\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+|[[:punct:]]/
end

# scan normally respects unicode.
puts unicode_string.scan(/.*/).join

# but this regex does not.
puts unicode_string.scan(regex).join

The result at the command line is

Up Arrow (↑) points up.
Up Arrow () points up.

Thanks!

adamflorin avatar Feb 13 '13 02:02 adamflorin

It's going to take me a little while to go describe the regex I'm afraid, but I'll take this as a bug report and try to fix it soon.

If you get to it sooner, please submit a pull request!

Thanks

hgmnz avatar Feb 13 '13 19:02 hgmnz

OK, thanks!

adamflorin avatar Feb 13 '13 19:02 adamflorin

I have the same problem using ruby 2.0.0-p0. It does not happen (to me) with ruby 1.9.3. It seems it uses a new regexp engine, which probably isn't fully backward compatible. I replaced \w with \p{word} (in the regex method) and looks like it solves this, but I'm not sure of the implications.

dmfrancisco avatar Mar 30 '13 19:03 dmfrancisco

Oops. It seems this has been solved on master already :smiley: Thanks for the hard work.

dmfrancisco avatar Mar 31 '13 13:03 dmfrancisco

Thanks for verifying @dmfrancisco :)

hgmnz avatar Mar 31 '13 16:03 hgmnz

Sorry @hgmnz, I should have tested this better before commenting. My tests pass for portuguese special characters but I tested the original string provided by @adamflorin and it seems to fail. Example:

truncate_html "café ↑ périferôl"
# => "café  périferôl"

In short, it seems the master branch fixes the issue for alphabets with special characters but not for unicode symbols.

dmfrancisco avatar Mar 31 '13 16:03 dmfrancisco

ahhh, thanks. Reopening this then

hgmnz avatar Mar 31 '13 18:03 hgmnz

Aha,truncate_html filt all the Chinese unicode words, this bug still exists.

halida avatar Jun 05 '13 12:06 halida

Looks like it works on master, and not work on gem?

halida avatar Jun 06 '13 00:06 halida

Looks like it works on master, and not work on gem?

Is that the case? There doesn't seem any changes since 0.9.2 that would do that, but it could be accidental

hgmnz avatar Jun 06 '13 03:06 hgmnz

@hgmnz Yes, http://gurudigger.com/products/tuicool I use truncate_html to implement "More" on this page。

halida avatar Jun 06 '13 03:06 halida

This is broken in version 0.9.2 of the gem.

alex94040 avatar Oct 15 '13 23:10 alex94040

I confirm, broken in version 0.9.2 and works for me using master branch. What about a 0.9.3 new gem ? ;)

afriqs avatar Nov 07 '13 10:11 afriqs

This is particularly painful in HTML use-cases (i.e. truncating stuff from TinyMCE) where random spaces are dropped because the &nbsp; character is not respected.

The second space is the 2 byte character Unicode for &nbsp;

[34] pry(main)> truncate_html("what about this: ↑")
=> "what aboutthis:"

Using 0.9.2

aguynamedben avatar Jul 17 '14 00:07 aguynamedben

I found this library that does not drop Unicode characters. https://github.com/nono/HTML-Truncator

Time for a beer!

aguynamedben avatar Jul 17 '14 01:07 aguynamedben

This is still an issue — emoji disappears 😢

lachlanjc avatar Nov 23 '15 04:11 lachlanjc

I confirm, version 0.9.3 removes Euro (€) and UK Pound Sterling (£) symbols.

togiberlin avatar Feb 05 '16 09:02 togiberlin