twitterscraper
twitterscraper copied to clipboard
WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes
A variety of issues have recently arisen due to Twitter disabling their "Legacy" API, breaking twitterscraper:
- https://github.com/taspinar/twitterscraper/issues/301
- https://github.com/taspinar/twitterscraper/issues/299
- https://github.com/taspinar/twitterscraper/issues/298
- https://github.com/taspinar/twitterscraper/issues/296
To fix this, I re-implemented query.py using Selenium, allowing twitterscraper to programatically control a background (headless) Firefox instance.
Additionally, I refactored query.py (now query_js.py) to be a bit cleaner.
Based on my testing, this branch can successfully download tweets from user pages, and via query strings.
How to run
Please test this change so I can fix any bugs!
- clone the repo, pull this branch
- install selenium dependencies (geckodriver and firefox) https://selenium-python.readthedocs.io/installation.html
- enter twitterscraper directory,
python3 setup.py install - run your query
If you have any bugs, please paste your command and full output in this thread!
Improvements
- Fix twitterscrapers failure due to twitter retiring legacy endpoints
- now multiple data points are retrieved, not just tweets, this includes user metedata, location metadata, etc. All these datapoints are sent to the browser and returned by
get_query_data(all tweets / metadata from a specific query) andget_user_data(all tweets / metadata on a users page). - Refactor query.py to be more clean
- previously
--userwouldn't get all of a users tweets and retweets due to a limitation in twitters scrollback for a given user. Now a workaround enables retrieving of tweets and retweets for a specific user via a custom search:f'filter:nativeretweets from:{from_user}' - fix https://github.com/taspinar/twitterscraper/issues/238
query_user_infobroken - fix https://github.com/taspinar/twitterscraper/issues/278
Notes
poswas removed - now the browser is used to storeposstate implicitly--javascriptand-jnow decide whether to usequery.pyorquery_js.py
Problems
- ~
limitno longer works, though this should be relatively easy to fix if sufficiently desired~ (limit has now been implemented query_user_infoandquery_user_pagehaven't been converted to use selenium, they don't work right now. However, this data is returned as part of the metadata mentioned in Improvements bullet 2- This change requires installing selenium and geckodriver which is more difficult than just
pip install. However use of docker can alleviate this. - Being that this uses a real browser, it will be slower (~1/2 as fast in my observations) and require more memory
- This changes the structure of the returned json object to match twitters response. On the plus side, it allows access to much more data than before.
Oh, that's amazing! Does multiple proxies also work with geckodriver? I had tested with Chrome and couldn't get it to work.
@AllanSCosta a new driver is created for each process in the pool, and each driver is initiated with a unique proxy.
This uses FirefoxDriver, but I think ChromeDriver would work for this too.
Beautiful, thanks!!
@lapp0, if you don't mind me asking, why was your previous usage of UserAgent dropped? I just did a quick run on it, and it seemed fine.
Thanks!
@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues.
@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues. Thank you,
I get an error like this: selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
Which file do i need to edit?
I got error like this: raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities
I got error like this: raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities
Problem solved. I forgot to get Firefox installed...😂
I get an error like this: selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
Which file do i need to edit?
You need to install Geckodriver. If it's a mac, brew install geckodriver should suffice.
Oh oops, you're right! I just pushed those changes in misc fixes, reverted!
Fun side note: if you want to see the browsers in actions (or if theres an issue see what's going wrong) allow the browser to be visible by setting driver.headless = False here https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R48
Make sure you limit the size of your pool to 1 though!
Hi @lapp0, I'm still debugging some stuff here. For some reason, the response is proper (200) and I do manage to get data, but in query_single_page the array relevant_requests ends up always empty. For testing I'm running tweets = get_user_data('realDonaldTrump').
[edit] Specifically, it seems that isinstance(r.response.body, dict) is always false in query_single_page
@AllanSCosta I could not reproduce. I'm able to get 1300 of trumps tweets.
Could you try again with latest changes, and set headless = False, and tell me if you see any errors on the twitter page itself? (Also add -j to your command)
As an aside, it appears that scrolling down on twitter stops after 1300 tweets on realDonaldTrumps page. I'll investigate how to continue scrolling.
Edit: It appears the non-js query.py only gets 621 tweets, so this may just be a fundamental limitation in twitter.
https://github.com/taspinar/twitterscraper/pull/304/files appears to fix the main issue. I am going to make js optional here so we can have a backup if/when #304's solution fails.
I ran the code tweets = get_user_data('realDonaldTrump') and got 0 tweets.
I also tried tweets = get_query_data("BTS", poolsize = 1, lang = 'english') and got nothing as well.
@AllanSCosta @pumpkinw can you please
- add
driver.get_screenshot("foo.png")to this line https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R126
- then share the resulting screenshot
- share your geckodriver version
- share your firefox version
- share your operating system and version
- share your selenium version
@lapp0
The screenshot correctly depicts Trump's twitter (as if I had manually opened the browser and accessed it). Here are the versions:
geckodriver 0.26.0 Firefox 77.0.1 (64-bit) OS and version macOS Mojave 10.14.5 Selenium 3.141.0
thanks @AllanSCosta
Are you using selenium-wire==1.1.2? It appears I'm using a dated version (0.7.0), as I was able to reproduce this problem by upgrading to 1.1.2.
I'm using seleniumwire version 1.1.2 indeed :). To clarify, it is properly accessing the page. It's only the parsing of the request results that are failing, as of now. I'm happy to help restructure it for the latest version of seleniumwire if that's the direction you think is the way to go :)
Please try now, I have pegged selenium-wire to 1.0.1
It works now, thanks!! Was the only thing you changed the version of seleniumwire?
Please try now, I have pegged selenium-wire to 1.0.1
Thanks! It works for me now!
@AllanSCosta Yes, version >=1.0.2 of selenium-wire doesn't do conversion from gzip bytes -> python object.
Thank you so much, but I'm kinda lost? I'm new to this and I can't seem to pull your branch from my github desktop. I've installed gecko and selenium, but I didn't understand exactly what I have to do to run the query with your changes. Sorry if it's too much trouble!
thanks for testing @barabelha ! To run with my changes, you must add the --javascript argument.
To use my branch you must git remote add upstream lapp0 and git fetch lapp0 and git checkout lapp0/selenium
Are these changes in the master branch now? I would like to use this on my app with pip install. I know there was an issue with twitter scraping from June 1 (their old site was deprecated) so using selenium fixes that. Does the master branch now work?
@bamboozooled #304 is in origin/master which fixes the legacy issue for now. It isn't on pip though. For now you need to git clone and python3 setup.py install
This PR isn't in master either, it's still open.
@taspinar what are the procedures to get #304 (currently in origin/master) to pypi? Do we just need a new verison tag on git and github automagically does the work? I think #304 is an important change to get to pypi since it fixes the program.
Thanks a lot @lapp0 !
Hi @lapp0 , I am new at using github so I wanted to know if you could give me more details of how to run "clone the repo, pull this branch" because I'm getting the same problems of getting 0 tweets when using the twitterscraper. Thank you!!
@Michelpayan
- Install git
- run the commands
git clone https://github.com/lapp0/twitterscraper.git
git checkout origin/selenium
change directories twitterscraper, then run python3 setup.py install along with the other install instructions in the post.
Let me know if you have any questions.