langchain Issue: Amazon Bedrock Jurassic model responses getting truncated

Issue you'd like to raise.

When I create a Bedrock LLM Client for the Jurassic models - "ai21.j2-mid", "ai21.j2-ultra", the llm.invoke(query) methods are not returning with full results. The result seems to be getting truncated after first line. It seems like the LLM Engine is streaming it's output but langchain llm.invoke method is not able to handle the streaming data. Similar issue is coming with the chain.invoke(query) methods as well.

However, llm.invoke works well when tried with AWS Bedrock "amazon.titan-tg1-large" model

Suggestion:

I guess the AI21 labs models are streaming output in a way that the LangChain llm.invoke( query) method is not able to handle properly.

Aug 14 '23 11:08 sudipkumarbhattacharya

🤖

Hello,

Thank you for bringing this issue to our attention. Based on your description, it seems like the llm.invoke(query) method is not handling the streaming output from the AI21 labs models correctly.

The llm.invoke(query) method in LangChain is designed to handle the output from different language models by converting the input into a suitable format, generating a prompt from the input, running the language model on the prompt, and returning the generated text. However, this method does not directly handle streaming data from language models. Instead, the stream and astream methods are responsible for handling streaming data. These methods yield chunks of text as they are generated by the language model, rather than waiting for the entire generation to complete.

If the AI21 labs models are streaming their output, it's possible that the llm.invoke(query) method is not able to handle this correctly, as it's designed to wait for the entire generation to complete before returning the result. This could explain why the result is getting truncated after the first line.

As a workaround, you could try using the stream or astream methods instead of llm.invoke(query) when working with the AI21 labs models. These methods are designed to handle streaming data and should be able to handle the output from these models correctly.

We will investigate this issue further and consider updating the llm.invoke(query) method to handle streaming data more effectively. We appreciate your patience and understanding as we work to resolve this issue.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Aug 14 '23 12:08 dosubot[bot]

@sudipkumarbhattacharya I am not entirely sure if Jurrasic models have streaming support. But for Bedrock, it is as of yet a not supported. Streaming is supported in Claude and Titan models though.

bedrock.invoke_model may not result in full completion, and that's what you get when you call llm.invoke. Check above pr that aims to fix that.

For streaming response the correct API is bedrock.invoke_model_with_response_stream not invoke_model.

Aug 23 '23 17:08 sansmoraxz

@sansmoraxz I had tried the AWS sample codes, with bedrock.invoke_model.

While bedrock.invoke_model returned multi-line response from the Jurassic models correctly, llm.invoke was returning only first line. For Titan model both were working fine.

I assumed that the truncation in llm.invoke may be related to some non-standard streaming style from Jurassic models. The root cause may be something different altogether.

Thank you for picking this up and providing a fix. I will test it out again.

Aug 23 '23 19:08 sudipkumarbhattacharya

Hey @sudipkumarbhattacharya, think LiteLLM could help here. We did the work to unify the streaming outputs across bedrock, openai, huggingface, etc.

import os 
from litellm import completion

os.environ["AWS_ACCESS_KEY_ID"] = ""
os.environ["AWS_SECRET_ACCESS_KEY"] = ""
os.environ["AWS_REGION_NAME"] = ""

## call anthropic 
response = completion(
            model="bedrock/anthropic.claude-instant-v1", 
            messages=[{ "content": "Hello, how are you?","role": "user"}],
            stream=True
)

for chunk in response:
    print(chunk)
    
## call j2 
response = completion(
            model="bedrock/ai21.j2-mid", 
            messages=[{ "content": "Hello, how are you?","role": "user"}],
            stream=True
)

for chunk in response:
    print(chunk)

Langchain Integration: https://python.langchain.com/docs/integrations/chat/litellm LiteLLM Docs: https://docs.litellm.ai/docs/providers/bedrock

Oct 22 '23 01:10 krrishdholakia

Thank you Krish, Will check this out.

Regards Sudip

On Sun, Oct 22, 2023 at 7:26 AM Krish Dholakia @.***> wrote:

Hey @sudipkumarbhattacharya https://github.com/sudipkumarbhattacharya, think LiteLLM https://github.com/BerriAI/litellm could help here. We did the work to unify the streaming outputs across bedrock, openai, huggingface, etc.

import os from litellm import completion os.environ["AWS_ACCESS_KEY_ID"] = ""os.environ["AWS_SECRET_ACCESS_KEY"] = ""os.environ["AWS_REGION_NAME"] = ""

call anthropic response = completion(
        model="bedrock/anthropic.claude-instant-v1",
        messages=[{ "content": "Hello, how are you?","role": "user"}],
        stream=True
) for chunk in response: print(chunk) ## call j2 response = completion( model="bedrock/ai21.j2-mid", messages=[{ "content": "Hello, how are you?","role": "user"}], stream=True ) for chunk in response: print(chunk)

Langchain Integration: https://python.langchain.com/docs/integrations/chat/litellm LiteLLM Docs: https://docs.litellm.ai/docs/providers/bedrock

— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/9199#issuecomment-1773969680, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMEPFW3HOP32FOKDI722BSDYAR4NLAVCNFSM6AAAAAA3PSO5LGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZTHE3DSNRYGA . You are receiving this because you were mentioned.Message ID: @.***>

Oct 26 '23 12:10 sudipkumarbhattacharya

🤖

@baskaryan, I could use your expertise on this one.

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Oct 26 '23 12:10 dosubot[bot]

AFAIK Jurassic still don't support response streaming. At least not on the playground.

Could anyone share some sample prompts to test it out? My use cases don't result in truncation.

If required perhaps #9410 can be reopened.

Oct 27 '23 12:10 sansmoraxz

Hey @sansmoraxz we handle this by faking the stream output, to ensure this doesn't break the stream format - https://github.com/BerriAI/litellm/blob/1aa53f632236185a3c3f532f7950132a7d02968d/litellm/utils.py#L3675

Oct 27 '23 16:10 krrishdholakia

If anyone is looking into this because their anthropic model output is also being truncated: This comes from class LLMInputOutputAdapter.prepare_input() defined in langchain_community Bedrock support. It artificially adds it's own token limit. To bypass it, construct a call like:

chat = BedrockChat(
    model_id="anthropic.claude-v2",
    model_kwargs={"max_tokens_to_sample":100000}
)

This prevents truncation. @sansmoraxz , is it possible to change the key name from max_token_to_sample tomax_tokens? It took me an extremely long time to find out this non-intuitive key name. It is not present in even the boto documentation or langchain documentation. Langchain chat gives wrong answers.

Jan 25 '24 18:01 Pramodh-G

@Pramodh-G the APIs are in the bedrock providers page. Check the request body.

Be careful with the timeouts though. Larger contents take longer to tokenize.

Jan 28 '24 12:01 sansmoraxz

After performing some investigation I can say that bedrock will try to respect the token count you are requesting for.

What @Pramodh-G observed is the default token count is too low for some use cases. Having a large token count should help bypass the edge case but this may in turn result in API calls taking longer.

Testing with raw boto3 client I observed that if the output has not been completed, it will show the stopReason as LENGTH (both streaming and non streaming). Recursively calling the API until we get a valid stopReason might be another approach. This approach would require some radical restructuring if we try to pull it off with langchain though.

Perhaps one of the maintainers can pitch in.

Jan 28 '24 12:01 sansmoraxz

langchain langchain copied to clipboard

Issue: Amazon Bedrock Jurassic model responses getting truncated

Issue you'd like to raise.

Suggestion:

Sources

call anthropic response = completion(

langchain
langchain copied to clipboard