llama.cpp examples : generate JSON according to schema

Depends on #1773 (to test this, merge that first)

Adds a Python script that converts a JSON schema into the grammar format from #1773. This allows generating JSON according to a schema, like Jsonformer or OpenAI's function calling.

Examples

Jsonformer Student Example

% cat ../schemas/student.json 
 {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "is_student": {"type": "boolean"},
        "courses": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
}
% ./main -m $LLAMA_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/student.json --prop-order 'is_student,name,age' )" -p 'Hermione Granger '
main: build = 694 (e8259e4)
main: seed  = 1686892597
llama.cpp: loading model from /Users/evan/llama-models/13B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


main: grammar:
<0>space_1 ::= <2>[ - ] | 
<9>space ::= <11>space_1 
...

 Hermione Granger { "is_student" : true, "name" : "Hermione", "age" :12, "courses" : [ "muggle studies","history of magic" , "charms","potion" ]} [end of text]

llama_print_timings:        load time =   396.96 ms
llama_print_timings:      sample time =    55.45 ms /    57 runs   (    0.97 ms per token)
llama_print_timings: prompt eval time =   347.81 ms /     6 tokens (   57.97 ms per token)
llama_print_timings:        eval time =  3898.12 ms /    56 runs   (   69.61 ms per token)
llama_print_timings:       total time =  4306.70 ms

Jsonformer car example

% cat ../schemas/car.json 
{"type": "object", "properties": {"car": {"type": "object", "properties": {"make": {"type": "string"}, "model": {"type": "string"}, "year": {"type": "number"}, "colors": {"type": "array", "items": {"type": "string"}}, "features": {"type": "object", "properties": {"audio": {"type": "object", "properties": {"brand": {"type": "string"}, "speakers": {"type": "number"}, "hasBluetooth": {"type": "boolean"}}}, "safety": {"type": "object", "properties": {"airbags": {"type": "number"}, "parkingSensors": {"type": "boolean"}, "laneAssist": {"type": "boolean"}}}, "performance": {"type": "object", "properties": {"engine": {"type": "string"}, "horsepower": {"type": "number"}, "topSpeed": {"type": "number"}}}}}}}, "owner": {"type": "object", "properties": {"firstName": {"type": "string"}, "lastName": {"type": "string"}, "age": {"type": "number"}}}}}
 % ./main -m $LLAMA_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/car.json --prop-order 'car,make,model,owner,firstName,lastName,age,year' )" -p 'Brought the 97 Civic in '
main: build = 694 (e8259e4)
main: seed  = 1686892847
llama.cpp: loading model from /Users/evan/llama-models/13B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


main: grammar:
<0>space_1 ::= <2>[ - ] | 
<9>space ::= <11>space_1 
...

 Brought the 97 Civic in { "car" : { "make" : "Honda", "model" : "Civic", "year" :1997, "colors": [ "Black","Silver","Gray"] , "features":{ "audio": {"brand": "Bose", "hasBluetooth": false, "speakers":10}, "performance":{"engine": "K20A2", "horsepower":230,"topSpeed":185},"safety": {"airbags":10, "laneAssist":false,"parkingSensors":false}} } , "owner" : { "firstName":"Brian","lastName":"O'Donnell" , "age":32} } [end of text]

llama_print_timings:        load time =   324.46 ms
llama_print_timings:      sample time =   196.27 ms /   182 runs   (    1.08 ms per token)
llama_print_timings: prompt eval time =   707.57 ms /    12 tokens (   58.96 ms per token)
llama_print_timings:        eval time = 12594.43 ms /   181 runs   (   69.58 ms per token)
llama_print_timings:       total time = 13515.57 ms

OpenAI-style function calling

% cat ../schemas/functions.json 
{
    "oneOf": [
        {
            "type": "object",
            "properties": {
                "function": {"const": "create_event"},
                "arguments": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "date": {"type": "string"},
                        "time": {"type": "string"}
                    }
                }
            }
        },
        {
            "type": "object",
            "properties": {
                "function": {"const": "search"},
                "arguments": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"}
                    }
                }
            }
        }
    ]
}
% ./main -m $LLAMA_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/functions.json --prop-order 'function,arguments' )" -p $'Transcript of AI assistant responding to user requests. It uses the APIs "search" and "create_event"\n\nRequest: Call mom at 5pm \nFunction Call: '
main: build = 694 (e8259e4)
main: seed  = 1686893039
llama.cpp: loading model from /Users/evan/llama-models/13B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


main: grammar:
<0>space_1 ::= <2>[ - ] | 
<9>space ::= <11>space_1 
<15>0-function ::= ..

 Transcript of AI assistant responding to user requests. It uses the APIs "search" and "create_event"

Request: Call mom at 5pm 
Function Call: {"function":"create_event","arguments":{"date":"2017-11-16T18:00:00+00:00","time":"17:00" , "title":"Call my mom" }}  [end of text]

llama_print_timings:        load time =   302.69 ms
llama_print_timings:      sample time =    63.82 ms /    63 runs   (    1.01 ms per token)
llama_print_timings: prompt eval time =  3517.46 ms /    42 tokens (   83.75 ms per token)
llama_print_timings:        eval time =  4388.51 ms /    62 runs   (   70.78 ms per token)
llama_print_timings:       total time =  7975.77 ms
% ./main -m $LLAMA_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/functions.json --prop-order 'function,arguments' )" -p $'Transcript of AI assistant responding to user requests. It uses the APIs "search" and "create_event"\n\nRequest: What meetings are happening this afternoon? \nFunction Call: ' 
main: build = 694 (e8259e4)
...

 Transcript of AI assistant responding to user requests. It uses the APIs "search" and "create_event"

Request: What meetings are happening this afternoon? 
Function Call: { "function": "search", "arguments": { "query": "what meetings are happening today?" } }  [end of text]

llama_print_timings:        load time =   300.87 ms
llama_print_timings:      sample time =    30.92 ms /    32 runs   (    0.97 ms per token)
llama_print_timings: prompt eval time =  3535.50 ms /    44 tokens (   80.35 ms per token)
llama_print_timings:        eval time =  2114.93 ms /    31 runs   (   68.22 ms per token)
llama_print_timings:       total time =  5684.63 ms

Jun 16 '23 05:06 ejones

Can we do this logic in C++ so that we can support this in main?

Jun 17 '23 07:06 howard0su

Well, this is usable with main (as in the examples) as an input to --grammar. In general, I think it would be more complex to do in C++. And the lack of built-in JSON would be a challenge (I believe C++ examples here using the JSON library have to be left out of make)

Jun 18 '23 22:06 ejones

Well, this is usable with main (as in the examples) as an input to --grammar. In general, I think it would be more complex to do in C++. And the lack of built-in JSON would be a challenge (I believe C++ examples here using the JSON library have to be left out of make)

Agree C++ may be complex. but JSON is used in server example already. check examples/server/json.hpp

Jun 19 '23 13:06 howard0su

Yeah, I thought based on the discussion that the JSON dependency meant that server had to be CMake-only and excluded from the Makefile. It does look like it's in the Makefile, although hidden behind a flag, so I might be wrong.

That said, there's still the complexity point. Do you feel that JSON schema support directly in main (vs a preprocessor to main) is sufficiently valuable to warrant the extra effort?

Jun 21 '23 22:06 ejones

My major usage today is using LLM as a backend of functions. In such scenario, I would always prefer LLM returns a valid JSON string to make the result easy to parse. In such scenario, JSON support is very useful.

Jun 22 '23 02:06 howard0su

@slaren or @SlyEcho either of you interested in reviewing this?

Jul 27 '23 01:07 ejones

It doesn't seem to match the schema on json.org, for example the root can also be an array or string or any value. I'm not sure if it's possible to transform that grammar into ours, the string escaping is probably the hardest part.

Jul 27 '23 07:07 SlyEcho

There's two separate grammars here - grammars/json.gbnf is a standalone, sample grammar, while examples/json-schema-to-grammar.py stitches a grammar together dynamically based on a schema. I just opted to update the generic JSON grammar in conjunction with this script to bring it up to spec (more on that later).

As for the root type: in grammars/json.gbnf, I tried setting root ::= value, but without any context, the model was likely to just produce e.g., a number and quit. Restricting it to an object seemed to give the best (most interesting) outcome for testing out JSON generation in the general case. For the schema-driven script, I've just pushed a fix to ensure that you can in fact generate from a schema denoting a primitive value, if that is of use to anyone.

Regarding the JSON spec, for this iteration I carefully followed the syntax on json.org for numbers and strings so it should in fact be compliant. The escaping is indeed there now:

string ::=
  "\"" (
    [^"\\] |
    "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes
  )* "\"" ws

Jul 28 '23 01:07 ejones

@ggerganov any interest in giving this a quick look?

Aug 02 '23 01:08 ejones

I tried the example from https://json-schema.org/learn/getting-started-step-by-step.html and the converter had issues with the properties description and integer.

But it works otherwise 👍

Aug 02 '23 08:08 SlyEcho

Thanks!

@SlyEcho I added support for the integer so that tutorial now runs up to the point that they split up the schemas:

% ./main -m $LLAMA2_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/getting-started-full.json --prop-order 'productName,price,productId,dimensions' )" 

...

 {"productName":"Blu-ray+DVD: The Good Dinosaur","price":10,"productId":452389,"dimensions":{"height":267,"length":152.4,"width":178},"tags":["Blu-ray","Comedy","Drama","Kids \u0026 Family","Sci-Fi \u0026 Fantasy"]} [end of text]

llama_print_timings:        load time =   351.81 ms
llama_print_timings:      sample time =   391.91 ms /   103 runs   (    3.80 ms per token,   262.81 tokens per second)
llama_print_timings: prompt eval time =   114.75 ms /     2 tokens (   57.38 ms per token,    17.43 tokens per second)
llama_print_timings:        eval time =  6675.73 ms /   102 runs   (   65.45 ms per token,    15.28 tokens per second)
llama_print_timings:       total time =  7242.08 ms

Aug 03 '23 01:08 ejones