dude
dude copied to clipboard
dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
- Playwright - https://playwright.dev/docs/auth#reuse-authentication-state - Check possibility if this can be done with other backends
Modified Selenium that circumvents anti-bot. Repo: https://github.com/ultrafunkamsterdam/undetected-chromedriver > Optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io Automatically downloads the...
1. Download by file extension 2. Download by mimetype, e.g. `png` should also match `image/png` mimetype ```console dude scrape ... --download png,jpg # download all png and jpg files dude...
[Autoscraper](https://github.com/alirezamika/autoscraper) is made for automatic web scraping to make scraping easy. I believe it would be incredible to also include it.
## Possible format: ```python @select(sample="path/to/training/data") def handler(result): return {"data": result} ``` ## Potential backends: - https://github.com/lorey/mlscraper
https://www.reddit.com/r/Python/comments/tc3x72/comment/i0xvy98/?utm_source=share&utm_medium=web2x&context=3
https://github.com/browserless/chrome#hosting-providers
- Set a value for Dude User-Agent instead of using the default values on each parser backend (e.g.: `pydude/{version} (+https://github.com/roniemartinez/dude)`) - Add option to override the User-Agent - For Playwright...