langstream icon indicating copy to clipboard operation
langstream copied to clipboard

Web crawler enhancements

Open cdbartholomew opened this issue 1 year ago • 0 comments

Having used the web crawler to crawl some sites, the following enhancements would be useful:

  • URL blacklist to exclude pages for crawling
  • Honoring robots.txt
  • Configure time for periodic re-crawl of the site
  • When recrawling the site, check if the page has been updated. If not, ignore it. This will save having to recalculate vector embeddings for an unchanged page.

cdbartholomew avatar Aug 30 '23 17:08 cdbartholomew