GloVe icon indicating copy to clipboard operation
GloVe copied to clipboard

Twitter preprocessing script

Open gombru opened this issue 6 years ago • 1 comments

I wanted to use the Twitter preprocessing script in https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb and found a few bugs there:

  1. URLS without http are not found
  2. Last gsub splits words with caps where it should not and adds the <ALLCAPS> word where it should not.

I think the script has not been tested, and probably is nto the one that was used to train the model, as discussed here https://groups.google.com/forum/#!searchin/globalvectors/preprocessing|sort:date/globalvectors/_X7hQBBuoLY/2ysMo1sWCQAJ

It's my first touch with Ruby but I've fixed those two bugs:

` def tokenize input

# Different regex parts for smiley faces
eyes = "[8:=;]"
nose = "['`\-]?"

input = input
	.gsub(/https?:\/\/\S+\b|www\.(\w+\.)+\S*/,"<URL>")
	.gsub(/www\.(\w+\.)+\S*/,"<URL>") # gombru: handle URLS without http
	.gsub("/"," / ") # Force splitting words appended with slashes (once we tokenized the URLs, of course)
	.gsub(/@\w+/, "<USER>")
	.gsub(/#{eyes}#{nose}[)d]+|[)d]+#{nose}#{eyes}/i, "<SMILE>")
	.gsub(/#{eyes}#{nose}p+/i, "<LOLFACE>")
	.gsub(/#{eyes}#{nose}\(+|\)+#{nose}#{eyes}/, "<SADFACE>")
	.gsub(/#{eyes}#{nose}[\/|l*]/, "<NEUTRALFACE>")
	.gsub(/<3/,"<HEART>")
	.gsub(/[-+]?[.\d]*[\d]+[:,.\d]*/, "<NUMBER>")
	.gsub(/#\S+/){ |hashtag| # Split hashtags on uppercase letters
		# TODO: also split hashtags with lowercase letters (requires more work to detect splits...)

		hashtag_body = hashtag[1..-1]
		if hashtag_body.upcase == hashtag_body
			result = "<HASHTAG> #{hashtag_body} <ALLCAPS>"
		else
			result = (["<HASHTAG>"] + hashtag_body.split(/(?=[A-Z])/)).join(" ")
		end
		result
	}
	.gsub(/([!?.]){2,}/){ # Mark punctuation repetitions (eg. "!!!" => "! <REPEAT>")
		"#{$~[1]} <REPEAT>"
	}
	.gsub(/\b(\S*?)(.)\2{2,}\b/){ # Mark elongated words (eg. "wayyyy" => "way <ELONG>")
		# TODO: determine if the end letter should be repeated once or twice (use lexicon/dict)
		$~[1] + $~[2] + " <ELONG>"
	}
	.gsub(/([^a-z0-9()<>'`\-]){1,}/){ |word|
		"#{word.downcase}" # gombru: Fixed bug, Downcasing all
	}

return input

end

puts tokenize($_) `

gombru avatar Jun 22 '18 10:06 gombru

Hi, I know I arrive a bit late to this post, but just in case anyone might be interesting in collecting new tweets and clean them, I can share with you this python code. The cleaning code is in a jupyter notebook because I wanted to show and test each step visually. Hope you find it useful :)

https://github.com/cyberosa/read_tweets_python

cyberosa avatar Aug 20 '20 18:08 cyberosa