NeMo-Guardrails [BUG] colang content causing streaming issues

Hello I have this sample code below that I am trying to run.

its based off of streaming docs. though I'm hitting issues with streaming the output of the LLMRail when using colang content. Do I have something setup incorrectly for my colang _content?

pip install --upgrade --force-reinstall nemoguardrails openai==0.27.8

import asyncio
from langchain_community.chat_models import ChatOpenAI
from nemoguardrails import LLMRails, RailsConfig
from nemoguardrails.streaming import StreamingHandler
import logging

logging.basicConfig(level=logging.INFO)

openai_api_key = 'sk-...'  # YOUR_API_TOKEN
llm = ChatOpenAI(model_name='gpt-3.5-turbo-16k',
                 openai_api_key=openai_api_key,
                 temperature=0,
                 streaming=True)

YAML_CONFIG = """
models:
  - type: main
    engine: openai
    model: gpt-3.5-turbo-16k

streaming: True
"""

colang_content = """
define user ask weather
  "how is the weather today?"
  "should I wear a coat?"

define bot answer weather service down
    "Unfortunately, I'm unable to access my weather service API at the moment. This means I can't provide you with the real-time weather information such as weather conditions or forecasts for your area"

define flow weather
  user ask weather
  bot answer weather service down

define flow
    user ...
    bot greeting
"""

async def demo_1():
    """Demo using the streaming of response chunks directly."""
    config = RailsConfig.from_content(yaml_content=YAML_CONFIG, colang_content=colang_content)
    #config = RailsConfig.from_content(yaml_content=YAML_CONFIG) #allows streaming but no colang
    app = LLMRails(config, llm=llm, verbose=False)

    history = [{"role": "user", "content": "tell me a story about Unicorns?"}]

    async for chunk in app.stream_async(messages=history):
        print(chunk, end="")


asyncio.run(demo_1())

I get entire result at end but not print every chunk that gets processed. Is that expect output?

could it be I'm using a different version of Colang language that LLMRails is expecting? Thank you for your time!

Jun 05 '24 20:06 shiv248

seems to be an issue with

define user ask weather
  "how is the weather today?"
  "should I wear a coat?"

when that chunk is removed from colang_content streaming works. Wonder why though..? Though it would remove the ability to find semantically similar user questions in regards to current weather, if proceeding forward with removal of chunk

Jun 05 '24 21:06 shiv248

@drazvan wanted to bump this, see if there's anything I can do to circumvent this unexpected result on the latest version of guardrails.

Thank you!

Jun 10 '24 06:06 shiv248

Hi @shiv248!

The issue is due to the LLM not following the exact format when generating the response. Instead of generating "<message>", it is missing the two leading spaces. If you want to test with a clone of the nemoguardrails repo, you can replace line 876 here: https://github.com/NVIDIA/NeMo-Guardrails/blob/develop/nemoguardrails/actions/llm/generation.py#L876

with

streaming_handler.set_pattern(prefix='"', suffix='"')

(removed the two leading spaces for the prefix).

We'll try to figure out a solution for this. And hopefully it will make it to 0.10.0.

Jun 10 '24 19:06 drazvan

@drazvan wow that's an interesting find. What if I set it to the other case in that if else statement, by doing output_parser: verbose_v1 in my config could that potentially fix it too?

since that would make it

streaming_handler.set_pattern(prefix='Bot message: "', suffix='"') which I can still work around post generation.

Jun 10 '24 20:06 shiv248

Yes, that is potentially a good workaround. If you override the prompt for gpt-3.5-turbo-16k similar to https://github.com/NVIDIA/NeMo-Guardrails/blob/main/nemoguardrails/llm/prompts/dolly.yml, then it could work.

Jun 10 '24 20:06 drazvan