ark-tweet-nlp
ark-tweet-nlp copied to clipboard
Use twitter-text to extract hashtags, mentions, and URLs
Currently the tokenizer has it's own regex's for hashtags, mentions, and URLs (and there's a comment about what the best URL pattern is). Twitter maintains a java library twitter-text that can extract these and handles all sorts of weird edge-cases. It also has a pretty good regex for getting URLs that aren't preceded by a protocol. Offloading the identification of the twitter-specific tokens to the twitter-maintained library would probably improve the identification of those items (or at the very least, mean it's making the same mistakes as Twitter itself)
It would be great to see a diff of tokenization under twokenize's current rules, versus what it is when using twitter-text's rules.