llm-scraper Hitting input token limit on local language models

trafficstars

When scraping fairly large websites, we hit the token limit and receive the GGML_ASSERT error:

 n_tokens_all <= cparams.n_batch

For smaller websites this isn't an issue.

We should think about decomposing the website into chunks if it hits a certain length threshold, summarising each chunk using the local language model, and then stitch together these summaries coherently using the model once more.

Another thought I've had is to take screenshots instead using playwright, and get some text recognition in there. Or perhaps even better, if there is a playwright method to only extract the text content, and leave the html entirely.

May 03 '24 14:05 Ademsk1

The example https://news.ycombinator.com actually runs into this. I get a GGML_ASSERT: D:\a\node-llama-cpp\node-llama-cpp\llama\llama.cpp\llama.cpp:11163: n_tokens_all <= cparams.n_batch error

May 04 '24 21:05 DraconPern

We can try and use the Accessibility feature on playwright https://playwright.dev/docs/accessibility-testing This would extract all the text. Could be a good start to reduce the HTML size. @mishushakov

May 06 '24 14:05 Ademsk1

Also getting this on GPT-4-Turbo on some web pages. Only seems to hit the context length when mode: "html" but I find that mode: "text" isn't as accurate.

May 11 '24 11:05 siquick

I use gpt-4-turbo and hit 4096 output token limit, the return data gets truncated resulting in an incomplete list with .... Is there a way to continue requesting the API to return the complete information?

Error message: Bad control character in string literal in JSON at position 1624
    at safeParseJSON (file:///D:/Code/test/node_modules/@ai-sdk/provider-utils/dist/index.mjs:252:63)
    at generateObject (file:///D:/Code/test/node_modules/ai/dist/index.mjs:680:23)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async generateAISDKCompletions (file:///D:/Code/test/node_modules/llm-scraper/dist/models.js:22:20)
    at async file:///D:/Code/test/main.js:48:18 {
  cause: [SyntaxError: Bad control character in string literal in JSON at position 1624],
  text: '{"top":[{"name":"Eric Alm","rank":"Professor","email":"NA","url":"https://be.mit.edu/directory/eric-alm","fields":["Biophysics","Computational Modeling","Energy","Macromolecular Biochemistry","Microbial Systems","Omics","Synthetic Biology","Systems Biology"]},{"name":"Mark Bathe","rank":"Professor","email":"NA","url":"https://be.mit.edu/directory/mark-bathe","fields":["Biological Imaging","Biomolecular Engineering","Biophysics","Computational Modeling","Drug Delivery","Energy","Nanoscale Engineering","Neurobiological"]},{"name":"Angela Belcher","rank":"Professor","email":"NA","url":"https://be.mit.edu/directory/angela-belcher","fields":["Biomaterials","Biomolecular Engineering","Energy","Nanoscale Engineering","Synthetic Biology"]},{"name":"Prerna Bhargava","rank":"Research/Teaching Staff","email":"NA","url":"https://be.mit.edu/directory/prerna-bhargava","fields":["NA"]},{"name":"Michael Birnbaum","rank":"Associate Professor","email":"NA","url":"https://be.mit.edu/directory/michael-birnbaum","fields":["Biomolecular Engineering","Biophysics","Infectious Disease","Macromolecular Biochemistry"]},{"name":"Paul Blainey","rank":"Professor","email":"NA","url":"https://be.mit.edu/directory/paul-blainey","fields":["Biological Imaging","Biophysics","Drug Delivery","Infectious Disease","Macromolecular Biochemistry","Microbial Pathogenesis","Microbial Systems","Nanoscale Engineering","Omics"]},{"name":"Ed Boyden","rank":"Professor","email":"NA","url":"https://be.mit.edu/directory/ed-boyden","fields":["Biological Imaging","Biomolecular Engineering","Computational Modeling","Drug Delivery","Nanoscale Enginee...\n' +
    '"]}'
}

Jul 11 '24 15:07 beiyanpiki

Hey @beiyanpiki this is not an issue with llm-scraper, but with Vercel AI SDK. You can report the issue here: https://github.com/vercel/ai/issues

Jul 13 '24 10:07 mishushakov

llm-scraper llm-scraper copied to clipboard

Hitting input token limit on local language models

llm-scraper
llm-scraper copied to clipboard