readabilitySAX
readabilitySAX copied to clipboard
i18n: expand full stop and comma code points
The Readability scoring algorithm should work with most languages, but the regular expressions in the current implementation can only handle basic ASCII punctuations.
I've expanded the code points to cover all Unicode specified full stops and commas (listed below).
In addition, there is one special case in the full stop list: 0x3002 IDEOGRAPHIC FULL STOP. This punctuation, rendered as 。
, is the standard full stop in Chinese and widely used in CJK regions. It dose not require a trailing whitespace, so I've slightly modified the re_sentence
as well.
Full stops
0x002E FULL STOP
0x0589 ARMENIAN FULL STOP
0x06D4 ARABIC FULL STOP
0x0701 SYRIAC SUPRALINEAR FULL STOP
0x0702 SYRIAC SUBLINEAR FULL STOP
0x1362 ETHIOPIC FULL STOP
0x166E CANADIAN SYLLABICS FULL STOP
0x1803 MONGOLIAN FULL STOP
0x1809 MONGOLIAN MANCHU FULL STOP
0x2CF9 COPTIC OLD NUBIAN FULL STOP
0x2CFE COPTIC FULL STOP
0x2E3C STENOGRAPHIC FULL STOP
0x3002 IDEOGRAPHIC FULL STOP
0xA4FF LISU PUNCTUATION FULL STOP
0xA60E VAI FULL STOP
0xA6F3 BAMUM FULL STOP
0xFE12 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
0xFE52 SMALL FULL STOP
0xFF0E FULLWIDTH FULL STOP
0xFF61 HALFWIDTH IDEOGRAPHIC FULL STOP
Commas
0x002C COMMA
0x055D ARMENIAN COMMA
0x060C ARABIC COMMA
0x07F8 NKO COMMA
0x1363 ETHIOPIC COMMA
0x1802 MONGOLIAN COMMA
0x1808 MONGOLIAN MANCHU COMMA
0x3001 IDEOGRAPHIC COMMA
0xA4FE LISU PUNCTUATION COMMA
0xA60D VAI COMMA
0xA6F5 BAMUM COMMA
0xFE10 PRESENTATION FORM FOR VERTICAL COMMA
0xFE11 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
0xFE50 SMALL COMMA
0xFE51 SMALL IDEOGRAPHIC COMMA
0xFF0C FULLWIDTH COMMA
0xFF64 HALFWIDTH IDEOGRAPHIC COMMA