outlines icon indicating copy to clipboard operation
outlines copied to clipboard

Are we able to structure JSON output into a single line with just one whitespace?

Open timothylimyl opened this issue 1 year ago • 4 comments

Presentation of the new feature

Output JSON without wasting tokens on whitespaces and linebreaks.

Example output: {"name": : "Tim" , "age" : 25 , "interest" : "llm" }

Where does it fit in Outlines?

Structured Generation

Are you willing to open a PR?

Yes.

timothylimyl avatar May 20 '24 09:05 timothylimyl

Please pass whitespace_token=r'[ ]?' to outlines.generate.json()

lapp0 avatar May 20 '24 09:05 lapp0

I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.

rlouf avatar May 20 '24 19:05 rlouf

I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.

My bike-shedding: typically with newlines there will be indentation involving 8, 12, or 16 spaces. We should set the default whitespace pattern to be r'[ ]?' and to make the json output a single line.

lapp0 avatar May 20 '24 20:05 lapp0

Fair. We can give it a try and see if we still get complaints from users.

rlouf avatar May 22 '24 08:05 rlouf

I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.

Where do you see that training data for JSON is under the scheme of 4 whitespaces and one line break?

The reason I am asking about single liner (flat) JSON output is to save tokens. My intuition is that a well-formatted JSON is meant for humans and not LLM, LLM can deal with flat JSON structure in both input and output. A single line JSON which is very long and potentially nested can be very difficult for human to read but the equivalent is most probably not true for LLM.

timothylimyl avatar May 23 '24 03:05 timothylimyl

This was addressed by #916, closing for now.

rlouf avatar May 24 '24 11:05 rlouf

@rlouf @lapp0 Thanks for making the change.

timothylimyl avatar May 29 '24 02:05 timothylimyl