onyx
onyx copied to clipboard
Web connector refactoring
Description
- Improve ScrapeSessionContext management and cleanup: Enhanced resource management within ScrapeSessionContext for better stability.
- Cache protected_url_check results using lru_cache: Added caching to DNS lookups for performance.
- Optimize Playwright startup and block unnecessary resources: Sped up Playwright by blocking non-essential resources.
- Use lxml parser for better perf
- Improve cookie handling logic: Made cookie handling session-aware and more efficient.
- Optimize PDF handling with HEAD request and streaming: Improved PDF detection and download efficiency.
- Improve scrolling, content type, and link handling in scrape: Refined page scrolling, added HTML content-type checks, and optimized internal link processing.
Someone is attempting to deploy a commit to the Danswer Team on Vercel.
A member of the Team first needs to authorize it.
This PR is stale because it has been open 75 days with no activity. Remove stale label or comment or this will be closed in 15 days.
Closing as deprecated