sglang
sglang copied to clipboard
Can SGL generate list of json?
I want to generate the following format, that is, list of jsons: [ {"name": "Alice", "age": 1}, {"name": "Bob", "age": 2}, ] The number of the objects in the list is random depending on the output of LLM. So can SGL support such format?
Of course, you can. Just specify the regex of the json list. Here is a simple example code:
from sglang import function, gen, set_default_backend, Runtime
JSON_ITEM_REGEX = r"{\"name\": \"\w+\", \"age\": \d+}"
JSON_LIST_REGEX = r"\[\n\s*(" + JSON_ITEM_REGEX + r",?\n\s*" + ")*" + r"\]"
@function
def regex_gen(s):
s += "Q: Show me a example of a list of JSON items, where each item has a name and an age.\n"
s += "Here is an example:\n"
s += """[
{"name": "Alice", "age": 1},
{"name": "Bob", "age": 2},
]\n\n
"""
s += "A: \n" + gen(
"answer",
temperature=0,
regex=JSON_LIST_REGEX,
max_tokens=100,
)
runtime = Runtime(model_path="meta-llama/Llama-2-7b-chat-hf")
set_default_backend(runtime)
state = regex_gen.run()
print(state.text())
runtime.shutdown()
And the result output is:
Q: Show me a example of a list of JSON items, where each item has a name and an age.
Here is an example:
[
{"name": "Alice", "age": 1},
{"name": "Bob", "age": 2},
]
A:
[
{"name": "Alice", "age": 1},
{"name": "Bob", "age": 2},
{"name": "Charlie", "age": 3},
]
However, letting LLM control the generated number of JSON items may be unreliable. By calling sgl.gen() with only the JSON item regex multiple times, you can decode whatever number you want.
Guidance provides better primitives such as zero_or_more and one_or_more, so will sgl supports such primitives like this: https://github.com/guidance-ai/guidance/discussions/402#discussioncomment-7863261
The primitive zero_or_more is more like a syntax sugar. We now only support constraint decoding with raw regex format; the higher-level primitives may be supported in the future, not only the primitives in this issue; you can see #39.
The guidance implements this with its recursively select: https://github.com/guidance-ai/guidance/blob/d36601b62096311988fbba1ba15ae4126fb695df/guidance/library/_one_or_more.py#L5
As for now, you can write it like this in our sglang frontend:
JSON_ITEM_REGEX = r'(\s)*{\"name\": \"\w+\", \"age\": \d+},\n'
def one_or_more_wrapper(regex_string):
return r'(' + regex_string + r')+'
...
s += (
"A: \n[\n"
+ gen(
"answer",
temperature=0,
regex=one_or_more_wrapper(JSON_ITEM_REGEX),
max_tokens=100,
)
+ "\n]"
)
...
If you are worried that the generated JSON item would be truncated due to the length limit, maybe in the future, we can have a mechanism to guarantee the integrity for some simple primitives like one_or_more, zero_or_more and ends_with.
Hi
The primitive
zero_or_moreis more like a syntax sugar. We now only support constraint decoding with raw regex format; the higher-level primitives may be supported in the future, not only the primitives in this issue; you can see #39.The guidance implements this with its recursively select: https://github.com/guidance-ai/guidance/blob/d36601b62096311988fbba1ba15ae4126fb695df/guidance/library/_one_or_more.py#L5
As for now, you can write it like this in our sglang frontend:
JSON_ITEM_REGEX = r'(\s)*{\"name\": \"\w+\", \"age\": \d+},\n' def one_or_more_wrapper(regex_string): return r'(' + regex_string + r')+' ... s += ( "A: \n[\n" + gen( "answer", temperature=0, regex=one_or_more_wrapper(JSON_ITEM_REGEX), max_tokens=100, ) + "\n]" ) ...If you are worried that the generated JSON item would be truncated due to the length limit, maybe in the future, we can have a mechanism to guarantee the integrity for some simple primitives like
one_or_more,zero_or_moreandends_with.
I hope to realize list of jsons and for every single json, I need to get its value during generation. But if I write the whole structure in one gen, I'm not able to get the value. I hope to realize the following structure:
@sgl.function
def single(s):
s += "{"
s += "'name': " + sgl.gen("name") + ","
if s["name"] == "A":
s += "'age': " + sgl.gen("age")
else:
s += "'phone': " + sgl.gen("phone")
s += "}"
@sgl.function
def multiple(s):
s += "["
s += zero_or_more(single(s))
s += "]"
Besides, using regular expression has the following drawbacks:
- Regular expression can not save tokens during generation. For example,
'age:'in the above example don't need to be generated, which can save some time. - The above regular expression will lead to a wrong format of list of jsons, that is, the last json of the list still ends with a comma, which is incorrect.
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.