anything-llm
anything-llm copied to clipboard
[FEAT] Website Scraping with depth level
I saw that there is a possibility to "embed" a website from a URL. However, this embedding is done only with depth 0, that is, it only scrapes the page given in the url. This means that if I have a tree structured site with depth 3, it is not possible to do the scraping and then embedding recursively by taking all the pages. Is this a bug or is it intended? Is there any plans to develop this functionality, if any? Or has something already been done? I don't know if it is possible, but I might as well implement this feature myself and then merge it.
Depth is not implemented currently because of how we originally implemented web-scraping. With recent developments, it is now possible to implement depth in the scraper, but the UI doesn't have the space for it.
In the future, all non-document scrapers will be moved to Data Connectors
where you can more granularly control the params for scraping, like depth. The workspace UI during scraping just doesn't have the space to add all the controls so design work needs to be done to support how we will bridge the two functionalities.
How do we even upload URLs (be at depth 0)
@bokey007
Thank you @timothycarambat
Can we not upload more than one URL / websites ?
@bokey007 you can use that input to submit each website one at a time. We dont have a bulk website scraper in the tool to scrape many websites at once.
Got it, Thanks
yes... we need this feature
Just to add to this, if I wanted some help with Gradio's documentation for example, then going to the link on their site would send me to https://www.gradio.app/docs/interface. If I add that link then I get that single page. However, there are dozens of pages, each with their own URL, and they would have to be entered one at a time - in the word's of the immortal meme - 'aint nobody got time for that!'
I know that there are potentially issues around html vs JS sites etc., but the inability to have it do some kind of crawl on a documentation site is unbelievably limiting. Whatever the solution is that you're thinking of please (pretty please?) implement it soon - it makes the use-case for the app very low without it for my circumstances, and I imagine many others like me. I will beta test it til kingdom come if you give me the chance.... nudge nudge ;-)
Just to add a little bit more to this, if/when at one point a web page can be scraped to a certain level, then in the future, that scrape may need to be refreshed as in the use case for documentation that gets updated over time. Brute force could work for a while (delete old scrape, scrape again), but that ain't elegant like your app.
@vrijsinghani this is what we are calling "live" documents. It's a different feature but something we will support eventually on the roadmap so we can "re-index" documents. We have done this exact thing in another product, so its a matter of implementation