Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

Scraping n levels deep

Open rawmean opened this issue 9 months ago • 4 comments

Is your feature request related to a problem? Please describe. I'd like to scrape a website n-levels deep.

Describe the solution you'd like For example, given url = example.com, the scraper should also follow the links in example.com and scrape those too

Describe alternatives you've considered I can use BeautifulSoup and download the pages and then feed them to this

rawmean avatar Apr 30 '24 01:04 rawmean

Hei @rawmean, we will add it in the to-do list for feature requests! It would be interesting to create a new graph for this and maybe calling it CrawlerGraph or DeepScraperGraph

PeriniM avatar Apr 30 '24 12:04 PeriniM

I'll try to take a stab at it. This is what I'm thinking: Input: URL

  1. FetchNode
  2. ParseNode
  3. RAGNode
  4. SearchLinkNode -> Get all the links on the page
  5. (new) LinkFilterNode -> Filter out potentially relevant links
  6. (new) RepeaterNode -> Executes graph from child node onwards once for each of the input link in parallel
  7. FetchNode
  8. ParseNode
  9. RAGNode
  10. (new) ContainsAnswerNode -> A new node type that can tell if the currect content contains the answer
  11. (new) ConditionalNode -> A new node with two children, if parent returns true, pick child 1 or else pick child 2 12a. GenerateAnswerNode 12b. Go to step 4 for next level of depth

Let me know if this looks reasonable or if you have any other plan/better alternative that you can think of

mayurdb avatar May 07 '24 13:05 mayurdb

Yeah, pls contact me thorough email ([email protected])

VinciGit00 avatar May 07 '24 14:05 VinciGit00

Sounds really intresting.

ChrisDelClea avatar May 09 '24 18:05 ChrisDelClea

I am looking for the feature too. There are two use cases: 1.Loop through several path levels of a website, to extract information from all item pages. like to extract all shop item informations, all renting houses prices and locations. In this case, I can specify which paths will be processed by regex expressions. 2.Loop through all pages of a small website. It behaves like crawler as nutch, while I can specify what I will get from each page. There is a prompt to match the target page, and a prompt to get data/files from that page. Sometimes I need to crawl all videos/images of a specified condition for the website.

davideuler avatar Sep 26 '24 09:09 davideuler