dspy icon indicating copy to clipboard operation
dspy copied to clipboard

Bypassing Context length / MaxToken length of LLMs using DSPy in context self instruct

Open sreenivasmrpivot opened this issue 2 years ago • 16 comments

I have tried using Llama V2 to generate synthetic data for self instruct. Unfortunately my Prompts are long and the prompt / response combination from the Llama 13b chat model constantly exceeds the 4096 token limitation.

Is there any way to bypass this limitation using DSPy with the Llama v2 model? Should I be using the chat model or any other model to be able to use in context self instruct with DSPy and Llama v2?

Are there any examples in DSPy documentation which I can refer to?

sreenivasmrpivot avatar Sep 01 '23 13:09 sreenivasmrpivot

I switched to using gpt-3.5-turbo-16k to get around this problem, but its a paid/closed model. Perhaps someone here can suggest an equivalent open source/free model

drawal1 avatar Sep 01 '23 15:09 drawal1

I guess Giraffe model has longer context and can get around it. So if I understand correctly, DSPy cannot help with this problem. The only way is to choose a model with longer context.

sreenivasmrpivot avatar Sep 01 '23 16:09 sreenivasmrpivot

Using a long-context model is the easiest thing.

But DSPy is a general framework. You can implement at least 5-6 different ways to deal with long context in your own logic. Think chunking with map/reduce style, etc.

If you can provide more details, I can suggest 1-2 approaches

okhat avatar Sep 01 '23 17:09 okhat

@okhat I am trying to implement Gorilla model which uses API from HuggingFace, TensorflowHub and PytorchHub. My goal is to generate synthetic data using a fully open source model and avoid using GPT4 for commercially viable reasons. So I want to make use of Llama 2, provide in-context self instruct prompts and get some output. However when I try to do that directly using text prompting, I exceed the 4096 tokens allowed by Llama. I end up getting the error Exception has occurred: APIError Invalid response object from API: '{"detail":{"object":"error","message":"This model\'s maximum context length is 4096 tokens. However, you requested 6276 tokens (2180 in the messages, 4096 in the completion). Please reduce the length of the messages or completion.","type":"invalid_request_error","param":null,"code":null}}' (HTTP response code was 400)

I am using vLLM and I guess you work with Rick Battle to some extent, I am trying to get this implemented and contribute to Rick's team.

Any suggestions are much appreciated.

sreenivasmrpivot avatar Sep 01 '23 20:09 sreenivasmrpivot

Thanks @sreenivasmrpivot. Yes we collaborate with Rick very frequently!

However, you requested 6276 tokens (2180 in the messages, 4096 in the completion)

This error seems like your input isn't actually that long. The prompt is just 2180 tokens. Do you need 4096 output tokens?

Maybe just set the output to 256 tokens? Or 512?

okhat avatar Sep 05 '23 13:09 okhat

@okhat I have attached my actual input prompt here. Do you still think I can get around the problem by controlling the output to 256 or 512? If yes, where can I set the output length in the code?

sample1.txt

The output from model is expected to have 10 "API-Inst pair - examples", which is pretty long.

If I use llama 2 13b, which has a max tokens of 4096, is there anyway to get this expected output using the combination of dspy and llama 2 13b?

If It is not possible, I am considering usage of https://huggingface.co/NousResearch/Yarn-Llama-2-13b-128k instead of llama 2 13b

sreenivasmrpivot avatar Sep 06 '23 23:09 sreenivasmrpivot

@okhat do you have any suggestions or updates for this ^^^?

sreenivasmrpivot avatar Sep 08 '23 17:09 sreenivasmrpivot

@sreenivasmrpivot you can increase max_tokens as follows: llm = dspy.OpenAI(model='gpt-3.5-turbo-16k', max_tokens=8000)

Off the top, could you generate one API-Inst pair at a time and pass the "instruction"'s of the previously generated API-Inst pairs., asking the model not to generate an AP-Inst pair similar to the ones already generated?

drawal1 avatar Sep 08 '23 18:09 drawal1

@drawal1 I like the suggestion regarding max_tokens and generating 1 pair at a time though I am not sure if the generation would avoid repetitions unless I try it.

However since gpt-3.5-turbo-16k has 16k context length, it might work. Would the above approach work for llama 2 which is only 4k in context length.

sreenivasmrpivot avatar Sep 08 '23 19:09 sreenivasmrpivot

4k tokens is roughly 3000 words, so llama 2 4k context might work. You won't know until you try

drawal1 avatar Sep 08 '23 19:09 drawal1

max_tokens refers to the maximum output tokens, @sreenivasmrpivot

setting it to 4000 for llama only makes sense if your input is empty, which it isn’t

just set to 512 or consider restructuring the output to be one at a time as @drawal1 suggests

okhat avatar Sep 08 '23 19:09 okhat

Is this resolved?

okhat avatar Sep 14 '23 12:09 okhat

I'm also having an issue with this---if I compile a Module with a teleprompter, then try to run it forward, it often creates prompts that are too long. Is there a way to avoid this?

ahoho avatar Dec 18 '23 21:12 ahoho

Hey @ahoho yes happy to help. I may need more details but basically:

you can reduce the parameters of the teleprompter (max_bootstrapped_demos and max_labeled_demos) for a start. They default to 4 and 16, respectively. Maybe do 1 and 0 to be extreme.

okhat avatar Dec 20 '23 04:12 okhat

Yes, I think this is the issue, the demonstrations end up creating a prompt that's too long. I think it's because I'm mirroring a RAG setting for classification, and the context is repeated for each of the bootstrapped demos.

ahoho avatar Dec 20 '23 13:12 ahoho

@ahoho Oh wow I just saw this by accident, not sure why I missed it earlier.

Did my suggestion resolve it? Setting max_bootstrapped_demos=1 and max_labeled_demos=0, assuming you're doing BootstrapFewShotWithRandomSearch

okhat avatar Dec 25 '23 02:12 okhat

Sorry, I also missed your response! Yes, that did resolve the problem

ahoho avatar Feb 05 '24 21:02 ahoho