GloVe
GloVe copied to clipboard
Twitter preprocessing script
I wanted to use the Twitter preprocessing script in https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb and found a few bugs there:
- URLS without http are not found
- Last gsub splits words with caps where it should not and adds the <ALLCAPS> word where it should not.
I think the script has not been tested, and probably is nto the one that was used to train the model, as discussed here https://groups.google.com/forum/#!searchin/globalvectors/preprocessing|sort:date/globalvectors/_X7hQBBuoLY/2ysMo1sWCQAJ
It's my first touch with Ruby but I've fixed those two bugs:
` def tokenize input
# Different regex parts for smiley faces
eyes = "[8:=;]"
nose = "['`\-]?"
input = input
.gsub(/https?:\/\/\S+\b|www\.(\w+\.)+\S*/,"<URL>")
.gsub(/www\.(\w+\.)+\S*/,"<URL>") # gombru: handle URLS without http
.gsub("/"," / ") # Force splitting words appended with slashes (once we tokenized the URLs, of course)
.gsub(/@\w+/, "<USER>")
.gsub(/#{eyes}#{nose}[)d]+|[)d]+#{nose}#{eyes}/i, "<SMILE>")
.gsub(/#{eyes}#{nose}p+/i, "<LOLFACE>")
.gsub(/#{eyes}#{nose}\(+|\)+#{nose}#{eyes}/, "<SADFACE>")
.gsub(/#{eyes}#{nose}[\/|l*]/, "<NEUTRALFACE>")
.gsub(/<3/,"<HEART>")
.gsub(/[-+]?[.\d]*[\d]+[:,.\d]*/, "<NUMBER>")
.gsub(/#\S+/){ |hashtag| # Split hashtags on uppercase letters
# TODO: also split hashtags with lowercase letters (requires more work to detect splits...)
hashtag_body = hashtag[1..-1]
if hashtag_body.upcase == hashtag_body
result = "<HASHTAG> #{hashtag_body} <ALLCAPS>"
else
result = (["<HASHTAG>"] + hashtag_body.split(/(?=[A-Z])/)).join(" ")
end
result
}
.gsub(/([!?.]){2,}/){ # Mark punctuation repetitions (eg. "!!!" => "! <REPEAT>")
"#{$~[1]} <REPEAT>"
}
.gsub(/\b(\S*?)(.)\2{2,}\b/){ # Mark elongated words (eg. "wayyyy" => "way <ELONG>")
# TODO: determine if the end letter should be repeated once or twice (use lexicon/dict)
$~[1] + $~[2] + " <ELONG>"
}
.gsub(/([^a-z0-9()<>'`\-]){1,}/){ |word|
"#{word.downcase}" # gombru: Fixed bug, Downcasing all
}
return input
end
puts tokenize($_) `
Hi, I know I arrive a bit late to this post, but just in case anyone might be interesting in collecting new tweets and clean them, I can share with you this python code. The cleaning code is in a jupyter notebook because I wanted to show and test each step visually. Hope you find it useful :)
https://github.com/cyberosa/read_tweets_python