onyx icon indicating copy to clipboard operation
onyx copied to clipboard

Feature Request: Use trafilatura for HTML parsing

Open emerzon opened this issue 1 year ago • 0 comments

Replace the web_html_cleanup function from html_utils.py (or potentially the whole web connector) with trafilatura

It has much better handling of "noisy" elements, and is able to output markdown text, keeping data like tables meaningful - which can greatly help the LLM to understand the data.

Also it has native features like text deduplication and target language, that could be used to further reduce noise or undesired text.

emerzon avatar Sep 16 '24 00:09 emerzon