commons-text icon indicating copy to clipboard operation
commons-text copied to clipboard

Windows-1252 encoding for HTML numeric entities

Open ifly6 opened this issue 4 years ago • 3 comments

Currently, if given text like so:

"arma virumque cano…"
"“bread and circuses”"

StringEscapeUtils currently returns the corresponding Unicode characters for points 128, 133, 147, and 148, which are bunch of obscure basically-never-used control characters that display as spaces. Those code points are, however, used more often in Windows-1252 encoding, corresponding to characters like € and ™.

I've changed NumericEntityUnescaper to treat HTML numeric entities corresponding to valid CP-1252 code points between 128 and 159 (inclusive) as CP-1252 characters and decode them to the corresponding punctuation marks etc instead of the obscure Unicode control characters.

ifly6 avatar Dec 13 '20 20:12 ifly6

There is a test failing... :-(

garydgregory avatar Dec 13 '20 20:12 garydgregory

Coverage Status

Coverage decreased (-0.03%) to 98.654% when pulling f5c12c993ec867b0fc9d2ebb0b9a70c818112738 on ifly6:cp1252 into fa366c88e3f70c367dd736b6fe2e38b7b66eddb3 on apache:master.

coveralls avatar Dec 13 '20 20:12 coveralls