onyx
onyx copied to clipboard
Feature Request: Use trafilatura for HTML parsing
Replace the web_html_cleanup function from html_utils.py (or potentially the whole web connector) with trafilatura
It has much better handling of "noisy" elements, and is able to output markdown text, keeping data like tables meaningful - which can greatly help the LLM to understand the data.
Also it has native features like text deduplication and target language, that could be used to further reduce noise or undesired text.