shot-scraper icon indicating copy to clipboard operation
shot-scraper copied to clipboard

Auto scraper?

Open walking-octopus opened this issue 1 year ago • 0 comments

There's a neat little package autoscraper that allows to quickly build no-code web extractors.

  • You take a page with known content.
  • Say what text from it you need and what alias to bind it to. For example, { "name": " Apple Mac Mini (256GB SSD, M1, 8GB)", "current_bid": "US $130.50", "end_of_bid": "Saturday, 11:32 PM" }
  • Fit the model to your known page and known data.
  • It then tries to find what DOM selectors can yield the desired data with best accuracy and saves it into a model object you can pickle, which you probably should given the known page may die long before the DOM changes, so best keep model creation somewhere in a notebook.
  • Now you can just predict that data from new URLs/DOMs.

I actually wonder the idea can be extended to also use data from the heap to try get the text out, especially given it's a lot messier than hunting for the selector.

May be prototyped as another CLI on top of heap, html, and image exporting here.

walking-octopus avatar Dec 14 '23 19:12 walking-octopus