commons-text
commons-text copied to clipboard
Windows-1252 encoding for HTML numeric entities
Currently, if given text like so:
"arma virumque cano…"
"“bread and circuses”"
StringEscapeUtils
currently returns the corresponding Unicode characters for points 128, 133, 147, and 148, which are bunch of obscure basically-never-used control characters that display as spaces. Those code points are, however, used more often in Windows-1252 encoding, corresponding to characters like € and ™.
I've changed NumericEntityUnescaper
to treat HTML numeric entities corresponding to valid CP-1252 code points between 128 and 159 (inclusive) as CP-1252 characters and decode them to the corresponding punctuation marks etc instead of the obscure Unicode control characters.
There is a test failing... :-(