anything-llm icon indicating copy to clipboard operation
anything-llm copied to clipboard

[FEAT] Website Scraping with depth level

Open Mirgiacomo opened this issue 1 year ago • 10 comments

I saw that there is a possibility to "embed" a website from a URL. However, this embedding is done only with depth 0, that is, it only scrapes the page given in the url. This means that if I have a tree structured site with depth 3, it is not possible to do the scraping and then embedding recursively by taking all the pages. Is this a bug or is it intended? Is there any plans to develop this functionality, if any? Or has something already been done? I don't know if it is possible, but I might as well implement this feature myself and then merge it.

Mirgiacomo avatar Dec 31 '23 10:12 Mirgiacomo

Depth is not implemented currently because of how we originally implemented web-scraping. With recent developments, it is now possible to implement depth in the scraper, but the UI doesn't have the space for it.

In the future, all non-document scrapers will be moved to Data Connectors where you can more granularly control the params for scraping, like depth. The workspace UI during scraping just doesn't have the space to add all the controls so design work needs to be done to support how we will bridge the two functionalities.

timothycarambat avatar Dec 31 '23 21:12 timothycarambat

How do we even upload URLs (be at depth 0)

bokey007 avatar Jan 11 '24 04:01 bokey007

@bokey007 Screen Shot 2024-01-10 at 9 43 59 PM

timothycarambat avatar Jan 11 '24 05:01 timothycarambat

Thank you @timothycarambat

Can we not upload more than one URL / websites ?

bokey007 avatar Jan 11 '24 12:01 bokey007

@bokey007 you can use that input to submit each website one at a time. We dont have a bulk website scraper in the tool to scrape many websites at once.

timothycarambat avatar Jan 11 '24 19:01 timothycarambat

Got it, Thanks

bokey007 avatar Jan 12 '24 03:01 bokey007

yes... we need this feature

pietrondo avatar Feb 28 '24 12:02 pietrondo

Just to add to this, if I wanted some help with Gradio's documentation for example, then going to the link on their site would send me to https://www.gradio.app/docs/interface. If I add that link then I get that single page. However, there are dozens of pages, each with their own URL, and they would have to be entered one at a time - in the word's of the immortal meme - 'aint nobody got time for that!'

I know that there are potentially issues around html vs JS sites etc., but the inability to have it do some kind of crawl on a documentation site is unbelievably limiting. Whatever the solution is that you're thinking of please (pretty please?) implement it soon - it makes the use-case for the app very low without it for my circumstances, and I imagine many others like me. I will beta test it til kingdom come if you give me the chance.... nudge nudge ;-)

Captain-Bacon avatar Feb 29 '24 22:02 Captain-Bacon

Just to add a little bit more to this, if/when at one point a web page can be scraped to a certain level, then in the future, that scrape may need to be refreshed as in the use case for documentation that gets updated over time. Brute force could work for a while (delete old scrape, scrape again), but that ain't elegant like your app.

vrijsinghani avatar Mar 12 '24 07:03 vrijsinghani

@vrijsinghani this is what we are calling "live" documents. It's a different feature but something we will support eventually on the roadmap so we can "re-index" documents. We have done this exact thing in another product, so its a matter of implementation

timothycarambat avatar Mar 12 '24 16:03 timothycarambat