shot-scraper
shot-scraper copied to clipboard
Auto scraper?
There's a neat little package autoscraper that allows to quickly build no-code web extractors.
- You take a page with known content.
- Say what text from it you need and what alias to bind it to. For example,
{ "name": " Apple Mac Mini (256GB SSD, M1, 8GB)", "current_bid": "US $130.50", "end_of_bid": "Saturday, 11:32 PM" }
- Fit the model to your known page and known data.
- It then tries to find what DOM selectors can yield the desired data with best accuracy and saves it into a
model
object you can pickle, which you probably should given the known page may die long before the DOM changes, so best keep model creation somewhere in a notebook. - Now you can just predict that data from new URLs/DOMs.
I actually wonder the idea can be extended to also use data from the heap to try get the text out, especially given it's a lot messier than hunting for the selector.
May be prototyped as another CLI on top of heap, html, and image exporting here.