pyTweetCleaner
pyTweetCleaner copied to clipboard
Python module to clean twitter JSON data or tweet text and remove unnecessary data such as hyperlinks, comments on someone else's tweet, non-ASCII chars, non-English tweets, and much more
pyTweetCleaner
Python module to clean twitter json data and remove unnecessary tweet data
Usage1:
>>> from pyTweetCleaner import TweetCleaner
>>> tc = TweetCleaner(remove_stop_words=True, remove_retweets=False)
>>> tc.clean_tweets(input_file='data/sample_input.json', output_file='data/sample_output.json')
Usage2:
>>> from pyTweetCleaner import TweetCleaner
>>> sample_text = 'RT @testUser: Cleaning unnecessary data with pyTweetCleaner by @kevalMorabia97. #pyTWEETCleaner, Check it out at https:\/\/github.com\/kevalmorabia97\/pyTweetCleaner and star the repo! '
>>>
>>> tc = TweetCleaner(remove_stop_words=False, remove_retweets=False)
>>> print(tc.get_cleaned_text(sample_text))
RT @testUser: Cleaning unnecessary data with pyTweetCleaner by @kevalMorabia97 #pyTWEETCleaner Check it out at and star the repo
>>>
>>> tc = TweetCleaner(remove_stop_words=False, remove_retweets=True)
>>> print(tc.get_cleaned_text(sample_text))
>>>
>>> tc = TweetCleaner(remove_stop_words=True, remove_retweets=False, stopwords_file='user_stopwords.txt')
>>> print(tc.get_cleaned_text(sample_text))
RT @testUser: unnecessary data pyTweetCleaner @kevalMorabia97 #pyTWEETCleaner Check star repo
Requirements:
- nltk>=3.2.4
pip install -r requirements.txt
Data Removed and Kept:
REMOVE: TWEETS THAT HAVE in_reply_to_status_id != null i.e. COMMENTS ON SOMEONE ELSE'S TWEETS
TWEETS THAT HAVE lang != en i.e. NOT IN ENGLISH LANGUAGE
DATA ABOUT DELETED TWEETS
NON-ASCII CHARACTERS FROM text
HYPERLINKS FROM text
STOPWORDS FROM text
KEEP: created_at
id
text
user_id
user_name
user_screen_name
user_followers_count
coordinates
place
retweet_count
entities
retweeted_status
Note: If you want only some of these data fields then comment others in pyTweetCleaner.py file.
If you want other fields also then add them in pyTweetCleaner.py