langstream
langstream copied to clipboard
Web crawler enhancements
Having used the web crawler to crawl some sites, the following enhancements would be useful:
- URL blacklist to exclude pages for crawling
- Honoring robots.txt
- Configure time for periodic re-crawl of the site
- When recrawling the site, check if the page has been updated. If not, ignore it. This will save having to recalculate vector embeddings for an unchanged page.