onyx icon indicating copy to clipboard operation
onyx copied to clipboard

Web connector refactoring

Open emerzon opened this issue 1 year ago • 1 comments

Description

  • Improve ScrapeSessionContext management and cleanup: Enhanced resource management within ScrapeSessionContext for better stability.
  • Cache protected_url_check results using lru_cache: Added caching to DNS lookups for performance.
  • Optimize Playwright startup and block unnecessary resources: Sped up Playwright by blocking non-essential resources.
  • Use lxml parser for better perf
  • Improve cookie handling logic: Made cookie handling session-aware and more efficient.
  • Optimize PDF handling with HEAD request and streaming: Improved PDF detection and download efficiency.
  • Improve scrolling, content type, and link handling in scrape: Refined page scrolling, added HTML content-type checks, and optimized internal link processing.

emerzon avatar Apr 21 '25 17:04 emerzon

Someone is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Apr 21 '25 17:04 vercel[bot]

This PR is stale because it has been open 75 days with no activity. Remove stale label or comment or this will be closed in 15 days.

github-actions[bot] avatar Jul 07 '25 11:07 github-actions[bot]

Closing as deprecated

emerzon avatar Jul 13 '25 04:07 emerzon