sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Llama-3 regex generation can get stuck in infinite generation beyond max_tokens and crash server (reproduction example)

Open Gintasz opened this issue 9 months ago • 2 comments

Hey, I've just been trying to catch this bug for half a day...

I've done pip install git+https://github.com/sgl-project/sglang.git@51104cd#subdirectory=python, which is the commit where 0.1.14 was mentioned.

Launched server like this:

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 42069 --host 0.0.0.0 --tp-size 1 --mem-fraction-static 0.85

When the script below is launched, the server will get stuck in an infinite generation loop, which is long beyond the specified max_tokens=1024. Then it will crash. In my app there was some CUDA device assertion error (although same problem), however, in the reproduced example below the error is RecursionError: maximum recursion depth exceeded while calling a Python object. This is the log of server: logfile.txt

import sglang as sgl
import asyncio
import time

@sgl.function
def demo(s):
    s += sgl.system("You are a text string generation. Your goal is to generate a response to the user's instruction.")
    s += sgl.user_begin() + """I instruct you to make 10000 random text strings. Format your response like this:
```yaml
- "string1"
- "string2"
```""" + sgl.user_end()
    s += sgl.assistant_begin() + "```yaml\n" + sgl.gen("answer", temperature=0, regex=r'- "[^"\n]+"(?:\n- "[^"\n]+")*\n```|```', stop="```", max_tokens=1024)

endpoint = sgl.RuntimeEndpoint("http://REMOTEIP:PORT")
sgl.set_default_backend(endpoint)

async def main():
    state = demo.run()

asyncio.run(main())

If regex is removed, then there is no problem, the generation will stop when the token limit is exceeded.

If I change the model to mistralai/Mistral-7B-Instruct-v0.2, then there appears no such issue.

Other than that, meta-llama/Meta-Llama-3-8B-Instruct does work with other prompts using the same regex.

Gintasz avatar May 08 '24 19:05 Gintasz

I've avoided the problem by replacing the YAML format for output generation with XML format. r"<array>\n(?:<string>.*?<\/string>\n)*<\/array>```"

Gintasz avatar May 10 '24 07:05 Gintasz

I had the same problem with llama3 refusing to stop despite using the llama3-instruct template "<|eot_id|>" appropriate stop string. I added "assistant" as a stop string in the call to sgl.gen and this seemed to have abated the issue entirely. Can you give that a try with your YAML regex?

IliaZenkov avatar May 14 '24 03:05 IliaZenkov