shot-scraper Auto scraper?

Auto scraper?

Open walking-octopus opened this issue 1 year ago • 0 comments

There's a neat little package autoscraper that allows to quickly build no-code web extractors.

You take a page with known content.
Say what text from it you need and what alias to bind it to. For example, { "name": " Apple Mac Mini (256GB SSD, M1, 8GB)", "current_bid": "US $130.50", "end_of_bid": "Saturday, 11:32 PM" }
Fit the model to your known page and known data.
It then tries to find what DOM selectors can yield the desired data with best accuracy and saves it into a model object you can pickle, which you probably should given the known page may die long before the DOM changes, so best keep model creation somewhere in a notebook.
Now you can just predict that data from new URLs/DOMs.

I actually wonder the idea can be extended to also use data from the heap to try get the text out, especially given it's a lot messier than hunting for the selector.

May be prototyped as another CLI on top of heap, html, and image exporting here.

Dec 14 '23 19:12 walking-octopus

shot-scraper shot-scraper copied to clipboard

Auto scraper?

shot-scraper
shot-scraper copied to clipboard