autolabel
autolabel copied to clipboard
[Feature Request]: Allow user to choose self.CHUNK_SIZE (or choose it automatically)
Is your feature request related to a problem? Please describe.
self.CHUNK_SIZE = 5 is hard-coded. However, such small chunks can be expensive, because most of the prompt is from the description and seed examples.
Describe the solution you'd like
First, chunk_size should be an optional parameter of LabelingAgent in its init with a default value of 5. That would be the simplest most immediate solution.
It would be great if it also supported chunk_size = "auto". Here's how it would work:
- At least for OpenAI, tiktoken tells you the number of tokens for a particular model. And the maximum number of tokens is known, e.g. 8191 for GPT4.
- For each batch, keep adding examples to the prompt until the token limit would be hit. You would compute this 1) by computing the number of tokens in the prompt. 2) By estimating the longest possible reply from the LLM, assuming each example got the longest possible label.
- If the result is incomplete, display a warning message to file an issue, indicating that the token estimation was too aggressive.
Additional context
- Users could experiment with the chunk_size to see for their task to see what is a high chunk size with high acceptance rate.
- There could be an aggressiveness parameter for auto-chunk creation, default 1.0. Lowering that would pretend that the max token is MAX_TOKENS[model] * aggressiveness