twitterscraper icon indicating copy to clipboard operation
twitterscraper copied to clipboard

WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes

Open lapp0 opened this issue 5 years ago • 120 comments

A variety of issues have recently arisen due to Twitter disabling their "Legacy" API, breaking twitterscraper:

  • https://github.com/taspinar/twitterscraper/issues/301
  • https://github.com/taspinar/twitterscraper/issues/299
  • https://github.com/taspinar/twitterscraper/issues/298
  • https://github.com/taspinar/twitterscraper/issues/296

To fix this, I re-implemented query.py using Selenium, allowing twitterscraper to programatically control a background (headless) Firefox instance.

Additionally, I refactored query.py (now query_js.py) to be a bit cleaner.

Based on my testing, this branch can successfully download tweets from user pages, and via query strings.

How to run

Please test this change so I can fix any bugs!

  1. clone the repo, pull this branch
  2. install selenium dependencies (geckodriver and firefox) https://selenium-python.readthedocs.io/installation.html
  3. enter twitterscraper directory, python3 setup.py install
  4. run your query

If you have any bugs, please paste your command and full output in this thread!

Improvements

  • Fix twitterscrapers failure due to twitter retiring legacy endpoints
  • now multiple data points are retrieved, not just tweets, this includes user metedata, location metadata, etc. All these datapoints are sent to the browser and returned by get_query_data (all tweets / metadata from a specific query) and get_user_data (all tweets / metadata on a users page).
  • Refactor query.py to be more clean
  • previously --user wouldn't get all of a users tweets and retweets due to a limitation in twitters scrollback for a given user. Now a workaround enables retrieving of tweets and retweets for a specific user via a custom search: f'filter:nativeretweets from:{from_user}'
  • fix https://github.com/taspinar/twitterscraper/issues/238 query_user_info broken
  • fix https://github.com/taspinar/twitterscraper/issues/278

Notes

  • pos was removed - now the browser is used to store pos state implicitly
  • --javascript and -j now decide whether to use query.py or query_js.py

Problems

  • ~limit no longer works, though this should be relatively easy to fix if sufficiently desired~ (limit has now been implemented
  • query_user_info and query_user_page haven't been converted to use selenium, they don't work right now. However, this data is returned as part of the metadata mentioned in Improvements bullet 2
  • This change requires installing selenium and geckodriver which is more difficult than just pip install. However use of docker can alleviate this.
  • Being that this uses a real browser, it will be slower (~1/2 as fast in my observations) and require more memory
  • This changes the structure of the returned json object to match twitters response. On the plus side, it allows access to much more data than before.

lapp0 avatar Jun 05 '20 03:06 lapp0

Oh, that's amazing! Does multiple proxies also work with geckodriver? I had tested with Chrome and couldn't get it to work.

AllanSCosta avatar Jun 05 '20 03:06 AllanSCosta

@AllanSCosta a new driver is created for each process in the pool, and each driver is initiated with a unique proxy.

This uses FirefoxDriver, but I think ChromeDriver would work for this too.

lapp0 avatar Jun 05 '20 04:06 lapp0

Beautiful, thanks!!

@lapp0, if you don't mind me asking, why was your previous usage of UserAgent dropped? I just did a quick run on it, and it seemed fine.

Thanks!

AllanSCosta avatar Jun 05 '20 04:06 AllanSCosta

@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues.

lapp0 avatar Jun 05 '20 04:06 lapp0

@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues. Thank you,

I get an error like this: selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

Which file do i need to edit?

hakanyusufoglu avatar Jun 05 '20 07:06 hakanyusufoglu

I got error like this: raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

yiw0104 avatar Jun 05 '20 07:06 yiw0104

I got error like this: raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

Problem solved. I forgot to get Firefox installed...😂

yiw0104 avatar Jun 05 '20 07:06 yiw0104

I get an error like this: selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

Which file do i need to edit?

You need to install Geckodriver. If it's a mac, brew install geckodriver should suffice.

AllanSCosta avatar Jun 05 '20 15:06 AllanSCosta

Oh oops, you're right! I just pushed those changes in misc fixes, reverted!

lapp0 avatar Jun 05 '20 15:06 lapp0

Fun side note: if you want to see the browsers in actions (or if theres an issue see what's going wrong) allow the browser to be visible by setting driver.headless = False here https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R48

Make sure you limit the size of your pool to 1 though!

lapp0 avatar Jun 05 '20 15:06 lapp0

Hi @lapp0, I'm still debugging some stuff here. For some reason, the response is proper (200) and I do manage to get data, but in query_single_page the array relevant_requests ends up always empty. For testing I'm running tweets = get_user_data('realDonaldTrump').

[edit] Specifically, it seems that isinstance(r.response.body, dict) is always false in query_single_page

AllanSCosta avatar Jun 05 '20 15:06 AllanSCosta

@AllanSCosta I could not reproduce. I'm able to get 1300 of trumps tweets.

Could you try again with latest changes, and set headless = False, and tell me if you see any errors on the twitter page itself? (Also add -j to your command)

lapp0 avatar Jun 05 '20 17:06 lapp0

As an aside, it appears that scrolling down on twitter stops after 1300 tweets on realDonaldTrumps page. I'll investigate how to continue scrolling.

Edit: It appears the non-js query.py only gets 621 tweets, so this may just be a fundamental limitation in twitter.

lapp0 avatar Jun 05 '20 17:06 lapp0

https://github.com/taspinar/twitterscraper/pull/304/files appears to fix the main issue. I am going to make js optional here so we can have a backup if/when #304's solution fails.

lapp0 avatar Jun 05 '20 17:06 lapp0

I ran the code tweets = get_user_data('realDonaldTrump') and got 0 tweets. I also tried tweets = get_query_data("BTS", poolsize = 1, lang = 'english') and got nothing as well.

yiw0104 avatar Jun 05 '20 18:06 yiw0104

@AllanSCosta @pumpkinw can you please

  1. add driver.get_screenshot("foo.png") to this line https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R126
  • then share the resulting screenshot
  1. share your geckodriver version
  2. share your firefox version
  3. share your operating system and version
  4. share your selenium version

lapp0 avatar Jun 05 '20 18:06 lapp0

@lapp0

The screenshot correctly depicts Trump's twitter (as if I had manually opened the browser and accessed it). Here are the versions:

geckodriver 0.26.0 Firefox 77.0.1 (64-bit) OS and version macOS Mojave 10.14.5 Selenium 3.141.0

AllanSCosta avatar Jun 05 '20 19:06 AllanSCosta

thanks @AllanSCosta

Are you using selenium-wire==1.1.2? It appears I'm using a dated version (0.7.0), as I was able to reproduce this problem by upgrading to 1.1.2.

lapp0 avatar Jun 05 '20 19:06 lapp0

I'm using seleniumwire version 1.1.2 indeed :). To clarify, it is properly accessing the page. It's only the parsing of the request results that are failing, as of now. I'm happy to help restructure it for the latest version of seleniumwire if that's the direction you think is the way to go :)

AllanSCosta avatar Jun 05 '20 19:06 AllanSCosta

Please try now, I have pegged selenium-wire to 1.0.1

lapp0 avatar Jun 05 '20 19:06 lapp0

It works now, thanks!! Was the only thing you changed the version of seleniumwire?

AllanSCosta avatar Jun 05 '20 19:06 AllanSCosta

Please try now, I have pegged selenium-wire to 1.0.1

Thanks! It works for me now!

yiw0104 avatar Jun 05 '20 20:06 yiw0104

@AllanSCosta Yes, version >=1.0.2 of selenium-wire doesn't do conversion from gzip bytes -> python object.

lapp0 avatar Jun 05 '20 20:06 lapp0

Thank you so much, but I'm kinda lost? I'm new to this and I can't seem to pull your branch from my github desktop. I've installed gecko and selenium, but I didn't understand exactly what I have to do to run the query with your changes. Sorry if it's too much trouble!

barabelha avatar Jun 06 '20 19:06 barabelha

thanks for testing @barabelha ! To run with my changes, you must add the --javascript argument.

To use my branch you must git remote add upstream lapp0 and git fetch lapp0 and git checkout lapp0/selenium

lapp0 avatar Jun 06 '20 19:06 lapp0

Are these changes in the master branch now? I would like to use this on my app with pip install. I know there was an issue with twitter scraping from June 1 (their old site was deprecated) so using selenium fixes that. Does the master branch now work?

oluwatimio avatar Jun 07 '20 23:06 oluwatimio

@bamboozooled #304 is in origin/master which fixes the legacy issue for now. It isn't on pip though. For now you need to git clone and python3 setup.py install

This PR isn't in master either, it's still open.

@taspinar what are the procedures to get #304 (currently in origin/master) to pypi? Do we just need a new verison tag on git and github automagically does the work? I think #304 is an important change to get to pypi since it fixes the program.

lapp0 avatar Jun 08 '20 00:06 lapp0

Thanks a lot @lapp0 !

oluwatimio avatar Jun 08 '20 00:06 oluwatimio

Hi @lapp0 , I am new at using github so I wanted to know if you could give me more details of how to run "clone the repo, pull this branch" because I'm getting the same problems of getting 0 tweets when using the twitterscraper. Thank you!!

Michelpayan avatar Jun 16 '20 12:06 Michelpayan

@Michelpayan

  1. Install git
  2. run the commands
git clone https://github.com/lapp0/twitterscraper.git
git checkout origin/selenium

change directories twitterscraper, then run python3 setup.py install along with the other install instructions in the post.

Let me know if you have any questions.

lapp0 avatar Jun 16 '20 16:06 lapp0