twitter-scraper-selenium
twitter-scraper-selenium copied to clipboard
Python's package to scrap Twitter's front-end easily
Twitter scraper selenium
Python's package to scrape Twitter's front-end easily with selenium.
Table of Contents
Table of Contents
-
Getting Started
- Prerequisites
- Installation
- Installing from source
- Installing with PyPI
- Usage
- Available Functions in this package- Summary
- Scraping profile's details
- In JSON Format - Example
- Function Argument
- Keys of the output
- Scraping profile's tweets
- In JSON format - Example
- In CSV format - Example
- Function Arguments
- Keys of the output data
- Scraping user's tweet using API
- In JSON format - Example
- Function Arguments
- Keys of the output
- Using scraper with proxy
- Unauthenticated Proxy
- Authenticated Proxy
- Privacy
- License
Prerequisites
Installation
Installing from the source
Download the source code or clone it with:
git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium
Open terminal inside the downloaded folder:
python3 setup.py install
Installing with PyPI
pip3 install twitter-scraper-selenium
Usage
Available Function In this Package - Summary
| Function Name | Function Description | Scraping Method | Scraping Speed |
scrape_profile() |
Scrape's Twitter user's profile tweets | Browser Automation | Slow |
get_profile_details() |
Scrape's Twitter user details. | HTTP Request | Fast |
scrape_profile_with_api() |
Scrape's Twitter tweets by twitter profile username. It expects the username of the profile | Browser Automation & HTTP Request | Fast |
Note: HTTP Request Method sends the request to Twitter's API directly for scraping data, and Browser Automation visits that page, scroll while collecting the data.
To scrape twitter profile details:
from twitter_scraper_selenium import get_profile_details
twitter_username = "TwitterAPI"
filename = "twitter_api_data"
browser = "firefox"
headless = True
get_profile_details(twitter_username=twitter_username, filename=filename, browser=browser, headless=headless)
Output:
{
"id": 6253282,
"id_str": "6253282",
"name": "Twitter API",
"screen_name": "TwitterAPI",
"location": "San Francisco, CA",
"profile_location": null,
"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
"url": "https:\/\/t.co\/8IkCzCDr19",
"entities": {
"url": {
"urls": [{
"url": "https:\/\/t.co\/8IkCzCDr19",
"expanded_url": "https:\/\/developer.twitter.com",
"display_url": "developer.twitter.com",
"indices": [
0,
23
]
}]
},
"description": {
"urls": []
}
},
"protected": false,
"followers_count": 6133636,
"friends_count": 12,
"listed_count": 12936,
"created_at": "Wed May 23 06:01:13 +0000 2007",
"favourites_count": 31,
"utc_offset": null,
"time_zone": null,
"geo_enabled": null,
"verified": true,
"statuses_count": 3656,
"lang": null,
"contributors_enabled": null,
"is_translator": null,
"is_translation_enabled": null,
"profile_background_color": null,
"profile_background_image_url": null,
"profile_background_image_url_https": null,
"profile_background_tile": null,
"profile_image_url": null,
"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
"profile_banner_url": null,
"profile_link_color": null,
"profile_sidebar_border_color": null,
"profile_sidebar_fill_color": null,
"profile_text_color": null,
"profile_use_background_image": null,
"has_extended_profile": null,
"default_profile": false,
"default_profile_image": false,
"following": null,
"follow_request_sent": null,
"notifications": null,
"translator_type": null
}
get_profile_details() arguments:
| Argument | Argument Type | Description |
| twitter_username | String | Twitter Username |
| output_filename | String | What should be the filename where output is stored?. |
| output_dir | String | What directory output file should be saved? |
| proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
Keys of the output: Detail of each key can be found here.
To scrape profile's tweets:
In JSON format:
from twitter_scraper_selenium import scrape_profile
microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
print(microsoft)
Output:
{
"1430938749840629773": {
"tweet_id": "1430938749840629773",
"username": "Microsoft",
"name": "Microsoft",
"profile_picture": "https://twitter.com/Microsoft/photo",
"replies": 29,
"retweets": 58,
"likes": 453,
"is_retweet": false,
"retweet_link": "",
"posted_time": "2021-08-26T17:02:38+00:00",
"content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
"hashtags": [],
"mentions": [],
"images": [],
"videos": [],
"tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
"link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
},...
}
In CSV format:
from twitter_scraper_selenium import scrape_profile
scrape_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")
Output:
...
scrape_profile() arguments:
| Argument | Argument Type | Description |
| twitter_username | String | Twitter username of the account |
| browser | String | Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox |
| proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
| tweets_count | Integer | Number of posts to scrape. Default is 10. |
| output_format | String | The output format, whether JSON or CSV. Default is JSON. |
| filename | String | If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed. |
| directory | String | If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory. |
| headless | Boolean | Whether to run crawler headlessly?. Default is True |
Keys of the output
| Key | Type | Description |
| tweet_id | String | Post Identifier(integer casted inside string) |
| username | String | Username of the profile |
| name | String | Name of the profile |
| profile_picture | String | Profile Picture link |
| replies | Integer | Number of replies of tweet |
| retweets | Integer | Number of retweets of tweet |
| likes | Integer | Number of likes of tweet |
| is_retweet | boolean | Is the tweet a retweet? |
| retweet_link | String | If it is retweet, then the retweet link else it'll be empty string |
| posted_time | String | Time when tweet was posted in ISO 8601 format |
| content | String | content of tweet as text |
| hashtags | Array | Hashtags presents in tweet, if they're present in tweet |
| mentions | Array | Mentions presents in tweet, if they're present in tweet |
| images | Array | Images links, if they're present in tweet |
| videos | Array | Videos links, if they're present in tweet |
| tweet_url | String | URL of the tweet |
| link | String | If any link is present inside tweet for some external website. |
To Scrap profile's tweets with API:
from twitter_scraper_selenium import scrape_profile_with_api
scrape_profile_with_api('elonmusk', output_filename='musk', tweets_count= 100)
scrape_profile_with_api() Arguments:
| Argument | Argument Type | Description |
| username | String | Twitter's Profile username |
| tweets_count | Integer | Number of tweets to scrape. |
| output_filename | String | What should be the filename where output is stored?. |
| output_dir | String | What directory output file should be saved? |
| proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
| browser | String | Which browser to use for extracting out graphql key. Default is firefox. |
| headless | String | Whether to run browser in headless mode? |
Output:
{
"1608939190548598784": {
"tweet_url" : "https://twitter.com/elonmusk/status/1608939190548598784",
"tweet_details":{
...
},
"user_details":{
...
}
}, ...
}
Using scraper with proxy (http proxy)
Just pass proxy argument to function.
from twitter_scraper_selenium import scrape_profile
scrape_profile("elonmusk", headless=False, proxy="66.115.38.247:5678", output_format="csv",filename="musk") #In IP:PORT format
Proxy that requires authentication:
from twitter_scraper_selenium import scrape_profile
microsoft_data = scrape_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
proxy="sajid:[email protected]:5678") # username:password@IP:PORT
print(microsoft_data)
Privacy
This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrape anything private.
LICENSE
MIT