sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Can SGL generate list of json?

Open CSWellesSun opened this issue 1 year ago • 6 comments

I want to generate the following format, that is, list of jsons: [ {"name": "Alice", "age": 1}, {"name": "Bob", "age": 2}, ] The number of the objects in the list is random depending on the output of LLM. So can SGL support such format?

CSWellesSun avatar Jan 19 '24 08:01 CSWellesSun

Of course, you can. Just specify the regex of the json list. Here is a simple example code:

from sglang import function, gen, set_default_backend, Runtime


JSON_ITEM_REGEX = r"{\"name\": \"\w+\", \"age\": \d+}"

JSON_LIST_REGEX = r"\[\n\s*(" + JSON_ITEM_REGEX + r",?\n\s*" + ")*" + r"\]"


@function
def regex_gen(s):
    s += "Q: Show me a example of a list of JSON items, where each item has a name and an age.\n"
    s += "Here is an example:\n"
    s += """[
    {"name": "Alice", "age": 1},
    {"name": "Bob", "age": 2},
]\n\n
"""
    s += "A: \n" + gen(
        "answer",
        temperature=0,
        regex=JSON_LIST_REGEX,
        max_tokens=100,
    )


runtime = Runtime(model_path="meta-llama/Llama-2-7b-chat-hf")
set_default_backend(runtime)

state = regex_gen.run()

print(state.text())

runtime.shutdown()

And the result output is:

Q: Show me a example of a list of JSON items, where each item has a name and an age.
Here is an example:
[
    {"name": "Alice", "age": 1},
    {"name": "Bob", "age": 2},
]


A: 
[
    {"name": "Alice", "age": 1},
    {"name": "Bob", "age": 2},
    {"name": "Charlie", "age": 3},
]

hnyls2002 avatar Jan 19 '24 08:01 hnyls2002

However, letting LLM control the generated number of JSON items may be unreliable. By calling sgl.gen() with only the JSON item regex multiple times, you can decode whatever number you want.

hnyls2002 avatar Jan 19 '24 08:01 hnyls2002

Guidance provides better primitives such as zero_or_more and one_or_more, so will sgl supports such primitives like this: https://github.com/guidance-ai/guidance/discussions/402#discussioncomment-7863261

CSWellesSun avatar Jan 19 '24 09:01 CSWellesSun

The primitive zero_or_more is more like a syntax sugar. We now only support constraint decoding with raw regex format; the higher-level primitives may be supported in the future, not only the primitives in this issue; you can see #39.

The guidance implements this with its recursively select: https://github.com/guidance-ai/guidance/blob/d36601b62096311988fbba1ba15ae4126fb695df/guidance/library/_one_or_more.py#L5

As for now, you can write it like this in our sglang frontend:

JSON_ITEM_REGEX = r'(\s)*{\"name\": \"\w+\", \"age\": \d+},\n'

def one_or_more_wrapper(regex_string):
    return r'(' + regex_string + r')+'

...
    s += (
        "A: \n[\n"
        + gen(
            "answer",
            temperature=0,
            regex=one_or_more_wrapper(JSON_ITEM_REGEX),
            max_tokens=100,
        )
        + "\n]"
    )
...

If you are worried that the generated JSON item would be truncated due to the length limit, maybe in the future, we can have a mechanism to guarantee the integrity for some simple primitives like one_or_more, zero_or_more and ends_with.

hnyls2002 avatar Jan 19 '24 10:01 hnyls2002

Hi

Vvvvhvh avatar Jan 19 '24 21:01 Vvvvhvh

The primitive zero_or_more is more like a syntax sugar. We now only support constraint decoding with raw regex format; the higher-level primitives may be supported in the future, not only the primitives in this issue; you can see #39.

The guidance implements this with its recursively select: https://github.com/guidance-ai/guidance/blob/d36601b62096311988fbba1ba15ae4126fb695df/guidance/library/_one_or_more.py#L5

As for now, you can write it like this in our sglang frontend:

JSON_ITEM_REGEX = r'(\s)*{\"name\": \"\w+\", \"age\": \d+},\n'

def one_or_more_wrapper(regex_string):
    return r'(' + regex_string + r')+'

...
    s += (
        "A: \n[\n"
        + gen(
            "answer",
            temperature=0,
            regex=one_or_more_wrapper(JSON_ITEM_REGEX),
            max_tokens=100,
        )
        + "\n]"
    )
...

If you are worried that the generated JSON item would be truncated due to the length limit, maybe in the future, we can have a mechanism to guarantee the integrity for some simple primitives like one_or_more, zero_or_more and ends_with.

I hope to realize list of jsons and for every single json, I need to get its value during generation. But if I write the whole structure in one gen, I'm not able to get the value. I hope to realize the following structure:

@sgl.function
def single(s):
    s += "{"
    s += "'name': " + sgl.gen("name") + ","
    if s["name"] == "A":
        s += "'age': " + sgl.gen("age")
    else:
        s += "'phone': " + sgl.gen("phone")
    s += "}"

@sgl.function
def multiple(s):
    s += "["
    s += zero_or_more(single(s))
    s += "]"

Besides, using regular expression has the following drawbacks:

  1. Regular expression can not save tokens during generation. For example, 'age:' in the above example don't need to be generated, which can save some time.
  2. The above regular expression will lead to a wrong format of list of jsons, that is, the last json of the list still ends with a comma, which is incorrect.

CSWellesSun avatar Jan 23 '24 07:01 CSWellesSun

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Jul 25 '24 06:07 github-actions[bot]