commons-text Windows-1252 encoding for HTML numeric entities

Windows-1252 encoding for HTML numeric entities

Open ifly6 opened this issue 4 years ago • 3 comments

Currently, if given text like so:

"arma virumque cano&#133;"
"&#147;bread and circuses&#148;"

StringEscapeUtils currently returns the corresponding Unicode characters for points 128, 133, 147, and 148, which are bunch of obscure basically-never-used control characters that display as spaces. Those code points are, however, used more often in Windows-1252 encoding, corresponding to characters like € and ™.

I've changed NumericEntityUnescaper to treat HTML numeric entities corresponding to valid CP-1252 code points between 128 and 159 (inclusive) as CP-1252 characters and decode them to the corresponding punctuation marks etc instead of the obscure Unicode control characters.

Dec 13 '20 20:12 ifly6

There is a test failing... :-(

Dec 13 '20 20:12 garydgregory

Coverage decreased (-0.03%) to 98.654% when pulling f5c12c993ec867b0fc9d2ebb0b9a70c818112738 on ifly6:cp1252 into fa366c88e3f70c367dd736b6fe2e38b7b66eddb3 on apache:master.

Dec 13 '20 20:12 coveralls

commons-text commons-text copied to clipboard

Windows-1252 encoding for HTML numeric entities

commons-text
commons-text copied to clipboard