Some information on the tokenization of URLs, mentions and numbers; license for the vectors themselves?

Open herwinvw opened this issue 5 years ago • 0 comments

Thanks a lot for composing the wordvecs and putting them online. That's a very nice resource for any Tweet-based NLP. I'm currently trying around with them for the Kaggle disaster tweet challenge. I have put the word2vec version online as a Kaggle Dataset: https://www.kaggle.com/herwinvw/twitter-word2vecs-wordvecs-from-godin, so they can be used easily in Kaggle notebooks. If you feel like maintaining it yourself there I would be happy to release ownership to you.

I still had a few questions about the word2vec dataset:

Which license do you release it under? Is it under the same license as the code?
Could you give a bit more information about the tokenization of the URLS, mentions and numbers? I reverse engineered that the tokens for these are _URL_ _MENTION_ and _NUMBER_. It would be nice to know the exact substitution though. For example, does _MENTION_ replace both the @ and the username? Does _NUMBER_ replace any sequence of digits (but not .,:)?

Jan 04 '21 15:01 herwinvw