twister-core
twister-core copied to clipboard
Split hashtags using utf8 characters
I've just noticed in top hashtags that Chinese hashtags are not broken on Chinese analogues of comma (\xEF\xBC\x8C
in UTF8) and point (\xE3\x80\x82
in UTF8). Most likely hashtags should be extracted using any of code points of UTF's Punctuation and Separator categories as a break character. Does anybody know how Twitter and other social networks process such a thing?
With US$ 3.50 Billion on cash i doubt they would't have noticed such a thing ;-)