A variety of issues have recently arisen due to Twitter disabling their "Legacy" API, breaking twitterscraper:

https://github.com/taspinar/twitterscraper/issues/301
https://github.com/taspinar/twitterscraper/issues/299
https://github.com/taspinar/twitterscraper/issues/298
https://github.com/taspinar/twitterscraper/issues/296

To fix this, I re-implemented query.py using Selenium, allowing twitterscraper to programatically control a background (headless) Firefox instance.

Additionally, I refactored query.py (now query_js.py) to be a bit cleaner.

Based on my testing, this branch can successfully download tweets from user pages, and via query strings.

How to run

Please test this change so I can fix any bugs!

clone the repo, pull this branch
install selenium dependencies (geckodriver and firefox) https://selenium-python.readthedocs.io/installation.html
enter twitterscraper directory, python3 setup.py install
run your query

If you have any bugs, please paste your command and full output in this thread!

Improvements

Fix twitterscrapers failure due to twitter retiring legacy endpoints
now multiple data points are retrieved, not just tweets, this includes user metedata, location metadata, etc. All these datapoints are sent to the browser and returned by get_query_data (all tweets / metadata from a specific query) and get_user_data (all tweets / metadata on a users page).
Refactor query.py to be more clean
previously --user wouldn't get all of a users tweets and retweets due to a limitation in twitters scrollback for a given user. Now a workaround enables retrieving of tweets and retweets for a specific user via a custom search: f'filter:nativeretweets from:{from_user}'
fix https://github.com/taspinar/twitterscraper/issues/238 query_user_info broken
fix https://github.com/taspinar/twitterscraper/issues/278

Notes

pos was removed - now the browser is used to store pos state implicitly
--javascript and -j now decide whether to use query.py or query_js.py

Problems

~limit no longer works, though this should be relatively easy to fix if sufficiently desired~ (limit has now been implemented
query_user_info and query_user_page haven't been converted to use selenium, they don't work right now. However, this data is returned as part of the metadata mentioned in Improvements bullet 2
This change requires installing selenium and geckodriver which is more difficult than just pip install. However use of docker can alleviate this.
Being that this uses a real browser, it will be slower (~1/2 as fast in my observations) and require more memory
This changes the structure of the returned json object to match twitters response. On the plus side, it allows access to much more data than before.

Jun 05 '20 03:06 lapp0

Oh, that's amazing! Does multiple proxies also work with geckodriver? I had tested with Chrome and couldn't get it to work.

Jun 05 '20 03:06 AllanSCosta

@AllanSCosta a new driver is created for each process in the pool, and each driver is initiated with a unique proxy.

This uses FirefoxDriver, but I think ChromeDriver would work for this too.

Jun 05 '20 04:06 lapp0

Beautiful, thanks!!

@lapp0, if you don't mind me asking, why was your previous usage of UserAgent dropped? I just did a quick run on it, and it seemed fine.

Thanks!

Jun 05 '20 04:06 AllanSCosta

@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues.

Jun 05 '20 04:06 lapp0

@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues. Thank you,

I get an error like this: selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

Which file do i need to edit?

Jun 05 '20 07:06 hakanyusufoglu

I got error like this: raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

Jun 05 '20 07:06 yiw0104

I got error like this: raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

Problem solved. I forgot to get Firefox installed...😂

Jun 05 '20 07:06 yiw0104

I get an error like this: selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

Which file do i need to edit?

You need to install Geckodriver. If it's a mac, brew install geckodriver should suffice.

Jun 05 '20 15:06 AllanSCosta

Oh oops, you're right! I just pushed those changes in misc fixes, reverted!

Jun 05 '20 15:06 lapp0

Fun side note: if you want to see the browsers in actions (or if theres an issue see what's going wrong) allow the browser to be visible by setting driver.headless = False here https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R48

Make sure you limit the size of your pool to 1 though!

Jun 05 '20 15:06 lapp0

Hi @lapp0, I'm still debugging some stuff here. For some reason, the response is proper (200) and I do manage to get data, but in query_single_page the array relevant_requests ends up always empty. For testing I'm running tweets = get_user_data('realDonaldTrump').

[edit] Specifically, it seems that isinstance(r.response.body, dict) is always false in query_single_page

Jun 05 '20 15:06 AllanSCosta

@AllanSCosta I could not reproduce. I'm able to get 1300 of trumps tweets.

Could you try again with latest changes, and set headless = False, and tell me if you see any errors on the twitter page itself? (Also add -j to your command)

Jun 05 '20 17:06 lapp0

As an aside, it appears that scrolling down on twitter stops after 1300 tweets on realDonaldTrumps page. I'll investigate how to continue scrolling.

Edit: It appears the non-js query.py only gets 621 tweets, so this may just be a fundamental limitation in twitter.

Jun 05 '20 17:06 lapp0

https://github.com/taspinar/twitterscraper/pull/304/files appears to fix the main issue. I am going to make js optional here so we can have a backup if/when #304's solution fails.

Jun 05 '20 17:06 lapp0

I ran the code tweets = get_user_data('realDonaldTrump') and got 0 tweets. I also tried tweets = get_query_data("BTS", poolsize = 1, lang = 'english') and got nothing as well.

Jun 05 '20 18:06 yiw0104

@AllanSCosta @pumpkinw can you please

add driver.get_screenshot("foo.png") to this line https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R126

then share the resulting screenshot

share your geckodriver version
share your firefox version
share your operating system and version
share your selenium version

Jun 05 '20 18:06 lapp0

@lapp0

The screenshot correctly depicts Trump's twitter (as if I had manually opened the browser and accessed it). Here are the versions:

geckodriver 0.26.0 Firefox 77.0.1 (64-bit) OS and version macOS Mojave 10.14.5 Selenium 3.141.0

Jun 05 '20 19:06 AllanSCosta

thanks @AllanSCosta

Are you using selenium-wire==1.1.2? It appears I'm using a dated version (0.7.0), as I was able to reproduce this problem by upgrading to 1.1.2.

Jun 05 '20 19:06 lapp0

I'm using seleniumwire version 1.1.2 indeed :). To clarify, it is properly accessing the page. It's only the parsing of the request results that are failing, as of now. I'm happy to help restructure it for the latest version of seleniumwire if that's the direction you think is the way to go :)

Jun 05 '20 19:06 AllanSCosta

Please try now, I have pegged selenium-wire to 1.0.1

Jun 05 '20 19:06 lapp0

It works now, thanks!! Was the only thing you changed the version of seleniumwire?

Jun 05 '20 19:06 AllanSCosta

Please try now, I have pegged selenium-wire to 1.0.1

Thanks! It works for me now!

Jun 05 '20 20:06 yiw0104

@AllanSCosta Yes, version >=1.0.2 of selenium-wire doesn't do conversion from gzip bytes -> python object.

Jun 05 '20 20:06 lapp0

Thank you so much, but I'm kinda lost? I'm new to this and I can't seem to pull your branch from my github desktop. I've installed gecko and selenium, but I didn't understand exactly what I have to do to run the query with your changes. Sorry if it's too much trouble!

Jun 06 '20 19:06 barabelha

thanks for testing @barabelha ! To run with my changes, you must add the --javascript argument.

To use my branch you must git remote add upstream lapp0 and git fetch lapp0 and git checkout lapp0/selenium

Jun 06 '20 19:06 lapp0

Are these changes in the master branch now? I would like to use this on my app with pip install. I know there was an issue with twitter scraping from June 1 (their old site was deprecated) so using selenium fixes that. Does the master branch now work?

Jun 07 '20 23:06 oluwatimio

@bamboozooled #304 is in origin/master which fixes the legacy issue for now. It isn't on pip though. For now you need to git clone and python3 setup.py install

This PR isn't in master either, it's still open.

@taspinar what are the procedures to get #304 (currently in origin/master) to pypi? Do we just need a new verison tag on git and github automagically does the work? I think #304 is an important change to get to pypi since it fixes the program.

Jun 08 '20 00:06 lapp0

Thanks a lot @lapp0 !

Jun 08 '20 00:06 oluwatimio

Hi @lapp0 , I am new at using github so I wanted to know if you could give me more details of how to run "clone the repo, pull this branch" because I'm getting the same problems of getting 0 tweets when using the twitterscraper. Thank you!!

Jun 16 '20 12:06 Michelpayan

@Michelpayan

Install git
run the commands

git clone https://github.com/lapp0/twitterscraper.git
git checkout origin/selenium

change directories twitterscraper, then run python3 setup.py install along with the other install instructions in the post.

Let me know if you have any questions.

Jun 16 '20 16:06 lapp0

twitterscraper
twitterscraper copied to clipboard

WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes

How to run

Improvements

Notes

Problems

twitterscraper twitterscraper copied to clipboard

WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes

How to run

Improvements

Notes

Problems

twitterscraper
twitterscraper copied to clipboard