dust Expose a tokenizer function in code blocks

The main idea would be to be able to split text in windows of token to be able to fit into the context windows of the llms. example: take these answers, group them in chunks of 4000 tokens, summarize each of those chunks, then group and summarize them recursively until you have 1 chunk of 4000 tokens that can be used to answer an original question.

May 10 '23 12:05 happysalada

Yes this is definitely on our radar. It is likely that we will expose these functions as part of code blocks in the near future :+1: Will keep that issue open to track progress.

May 10 '23 12:05 spolu

In Dust.tt, you can use JSONL for your datasets. I use tokenizers functions to write my JSONLs with X tokens per Line, and then Dust.tt does the rest. Regarding tokenizers, here are some library suggestions: gpt-tokenizer, tiktoken, and gpt-3-encoder. For code examples for any of these, please leave a comment and I will provide more details.

I recommend the following article, which provides an in-depth explanation on how to achieve effective recursive summarization. Although it's slightly different from what you asked for, it's a good starting point. 'In summary, our results show that combining recursive task decomposition with learning from human feedback can be a practical approach to scalable oversight for difficult long-document NLP tasks.' (Recursively Summarizing Books with Human Feedback)

Jun 08 '23 13:06 cmirdesouza