playwright-webcrawler
playwright-webcrawler copied to clipboard
Parallel crawler powered by Playwright-Python
playwright-webcrawler
Parallel crawler powered by Headless browser (Chromium, Firefox and WebKit)
Features
Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites require JS to function properly and to make the scraper more similar to humans.
playwright-webcrawler is intended to be used for parallel crawling of web pages using headless browser with playwright-python.
| Linux | macOS | Windows | |
|---|---|---|---|
| Chromium 86.0.4238.0 | ✅ | ✅ | ✅ |
| WebKit 14.0 | ✅ | ✅ | ✅ |
| Firefox 80.0b8 | ✅ | ✅ | ✅ |
Headless execution is supported for all browsers on all platforms.
Installation
playwright-webcrawler uses Python 3 (lowest version tested is 3.7.0).
Install requirements:
pip install -r requirements.txt
Note: playwright-webcrawler contains Playwright for Python. If you want to downloads a recent version of browsers binaries for Chromium, Firefox and WebKit, you must do:
python -m playwright install
Configuration file
playwright-webcrawler uses configuration files settings.py in order to store all configuration options.
-
ROBOTSTXT_OBEY
If True, playwright-webcrawler will respect robots.txt policies.
-
CONCURRENT_REQUESTS
The maximum number of concurrent (i.e. simultaneous) requests that will be performed by the crwler.
-
PLAYWRIGHT_NAVIGATION_TIMEOUT = 30000
The amount of time (in millisecs) that the browser will wait before timing out.
-
PLAYWRIGHT_BROWSER_TYPE
Browser type (chromium, firefox, webkit) created when Playwright connects to a browser instance.
-
PLAYWRIGHT_LAUNCH_OPTIONS
Set of configurable options to set on the browser. See browserType.launch([options]) for description fields
Usage
- Once your configuration file is saved, simply launch your first crawl:
python main.py <url> - Wait it crawls the whole webiste or exit using
^C
How is this different from Playwright?
This crawler is built on top of Playwright for Python.
Playwright for Python provides low to mid level APIs to manupulate headless browser, so you can build your own crawler with it. This way you have more controls on what features to implement in order to satisfy your needs.
However, most crawlers requires such common features as following links, obeying robots.txt and etc.
This crawler is a general solution for most crawling purposes. If you want to quickly start crawling with headless browser, this crawler is for you.
Contributing
If you wish to contribute to this repository or to report an issue, please do this on Github issues.