TwitterEmbeddings
TwitterEmbeddings copied to clipboard
Some information on the tokenization of URLs, mentions and numbers; license for the vectors themselves?
Thanks a lot for composing the wordvecs and putting them online. That's a very nice resource for any Tweet-based NLP. I'm currently trying around with them for the Kaggle disaster tweet challenge. I have put the word2vec version online as a Kaggle Dataset: https://www.kaggle.com/herwinvw/twitter-word2vecs-wordvecs-from-godin, so they can be used easily in Kaggle notebooks. If you feel like maintaining it yourself there I would be happy to release ownership to you.
I still had a few questions about the word2vec dataset:
- Which license do you release it under? Is it under the same license as the code?
- Could you give a bit more information about the tokenization of the URLS, mentions and numbers? I reverse engineered that the tokens for these are
_URL__MENTION_and_NUMBER_. It would be nice to know the exact substitution though. For example, does_MENTION_replace both the @ and the username? Does_NUMBER_replace any sequence of digits (but not .,:)?