Scrapegraph-ai
Scrapegraph-ai copied to clipboard
How could I remove part of page content before sending to LLM?
Is your feature request related to a problem? Please describe. Lots of LLMs support only 32k tokens. And many webpage content has tokens more than 32k. When I send the page content to LLM like qwen, deepseek, it all failed.
Describe the solution you'd like A way to clean the HTML before sending to LLM. If I can remove some parts of the page html, the sized could be reduces so that it would not exceeds the maximum tokens for a model.
Describe alternatives you've considered Tried different models, and try to set max_tokens.
Additional context I am scraping some pages which has lots of HTML content. But the important content is far less than 32k. The header content and the bottom content of the page takes lots of tokens in the HTML. I hope that I can remove them before send to LLM.