Scrapegraph-ai
Scrapegraph-ai copied to clipboard
Scraping n levels deep
Is your feature request related to a problem? Please describe. I'd like to scrape a website n-levels deep.
Describe the solution you'd like For example, given url = example.com, the scraper should also follow the links in example.com and scrape those too
Describe alternatives you've considered I can use BeautifulSoup and download the pages and then feed them to this
Hei @rawmean, we will add it in the to-do list for feature requests! It would be interesting to create a new graph for this and maybe calling it CrawlerGraph
or DeepScraperGraph
I'll try to take a stab at it. This is what I'm thinking: Input: URL
- FetchNode
- ParseNode
- RAGNode
- SearchLinkNode -> Get all the links on the page
- (new) LinkFilterNode -> Filter out potentially relevant links
- (new) RepeaterNode -> Executes graph from child node onwards once for each of the input link in parallel
- FetchNode
- ParseNode
- RAGNode
- (new) ContainsAnswerNode -> A new node type that can tell if the currect content contains the answer
- (new) ConditionalNode -> A new node with two children, if parent returns true, pick child 1 or else pick child 2 12a. GenerateAnswerNode 12b. Go to step 4 for next level of depth
Let me know if this looks reasonable or if you have any other plan/better alternative that you can think of
Yeah, pls contact me thorough email ([email protected])
Sounds really intresting.
I am looking for the feature too. There are two use cases: 1.Loop through several path levels of a website, to extract information from all item pages. like to extract all shop item informations, all renting houses prices and locations. In this case, I can specify which paths will be processed by regex expressions. 2.Loop through all pages of a small website. It behaves like crawler as nutch, while I can specify what I will get from each page. There is a prompt to match the target page, and a prompt to get data/files from that page. Sometimes I need to crawl all videos/images of a specified condition for the website.