ark-tweet-nlp icon indicating copy to clipboard operation
ark-tweet-nlp copied to clipboard

Use twitter-text to extract hashtags, mentions, and URLs

Open jrnold opened this issue 6 years ago • 1 comments

Currently the tokenizer has it's own regex's for hashtags, mentions, and URLs (and there's a comment about what the best URL pattern is). Twitter maintains a java library twitter-text that can extract these and handles all sorts of weird edge-cases. It also has a pretty good regex for getting URLs that aren't preceded by a protocol. Offloading the identification of the twitter-specific tokens to the twitter-maintained library would probably improve the identification of those items (or at the very least, mean it's making the same mistakes as Twitter itself)

jrnold avatar Aug 24 '17 17:08 jrnold

It would be great to see a diff of tokenization under twokenize's current rules, versus what it is when using twitter-text's rules.

brendano avatar Aug 27 '17 20:08 brendano