community icon indicating copy to clipboard operation
community copied to clipboard

[INS-2214] [Feature] Web Crawler Operator

Open praharshjain opened this issue 9 months ago • 5 comments

Is There an Existing Issue for This?

  • [X] I have searched the existing issues

Project

Instill VDP

Is your Proposal Related to a Problem?

No, it is a new feature request.

Describe Your Proposed Solution

We can implement a "Web Crawler" operator that will take an initial URL & a depth (int) as input and recursively extract links from those pages up to the given depth, finally returning a list of strings (extracted URLs).

Highlight the Benefits

Such an operator will be useful for crawling and gathering online data. For example, the links captured by it can then be fed to the text extraction operator to build a knowledge base from linked documents.

Anything Else?

No response

INS-2214

praharshjain avatar Sep 29 '23 19:09 praharshjain