twister-core icon indicating copy to clipboard operation
twister-core copied to clipboard

Split hashtags using utf8 characters

Open dryabov opened this issue 8 years ago • 1 comments

I've just noticed in top hashtags that Chinese hashtags are not broken on Chinese analogues of comma (\xEF\xBC\x8C in UTF8) and point (\xE3\x80\x82 in UTF8). Most likely hashtags should be extracted using any of code points of UTF's Punctuation and Separator categories as a break character. Does anybody know how Twitter and other social networks process such a thing?

dryabov avatar Mar 27 '16 19:03 dryabov

With US$ 3.50 Billion on cash i doubt they would't have noticed such a thing ;-)

miguelfreitas avatar Mar 27 '16 21:03 miguelfreitas