TweetScraper
TweetScraper copied to clipboard
TweetScraper is a simple crawler/spider for Twitter Search without using API
Introduction
TweetScraper can get tweets from Twitter Search.
It is built on Scrapy without using Twitter's APIs.
The crawled data is not as clean as the one obtained by the APIs, but the benefits are you can get rid of the API's rate limits and restrictions. Ideally, you can get all the data from Twitter Search.
WARNING: please be polite and follow the crawler's politeness policy.
Installation
-
Install
conda, you can get it from miniconda. The tested python version is3.7. -
Install selenium python bindings: https://selenium-python.readthedocs.io/installation.html. (Note: the
KeyError: 'driver'is caused by wrong setup) -
For ubuntu or debian user, run:
$ bash install.sh $ conda activate tweetscraper $ scrapy list $ #If the output is 'TweetScraper', then you are ready to go.the
install.shwill create a new environmenttweetscraperand install all the dependencies (e.g.,firefox-geckodriverandfirefox),
Usage
-
Change the
USER_AGENTinTweetScraper/settings.pyto identify who you areUSER_AGENT = 'your website/e-mail' -
In the root folder of this project, run command like:
scrapy crawl TweetScraper -a query="foo,#bar"where
queryis a list of keywords seperated by comma and quoted by". The query can be any thing (keyword, hashtag, etc.) you want to search in Twitter Search.TweetScraperwill crawl the search results of the query and save the tweet content and user information. -
The tweets will be saved to disk in
./Data/tweet/in default settings and./Data/user/is for user data. The file format is JSON. Change theSAVE_TWEET_PATHandSAVE_USER_PATHinTweetScraper/settings.pyif you want another location.
Acknowledgement
Keeping the crawler up to date requires continuous efforts, please support our work via opencollective.com/tweetscraper.
License
TweetScraper is released under the GNU GENERAL PUBLIC LICENSE, Version 2