outlines
outlines copied to clipboard
Are we able to structure JSON output into a single line with just one whitespace?
Presentation of the new feature
Output JSON without wasting tokens on whitespaces and linebreaks.
Example output: {"name": : "Tim" , "age" : 25 , "interest" : "llm" }
Where does it fit in Outlines?
Structured Generation
Are you willing to open a PR?
Yes.
Please pass whitespace_token=r'[ ]?' to outlines.generate.json()
I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.
I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.
My bike-shedding: typically with newlines there will be indentation involving 8, 12, or 16 spaces. We should set the default whitespace pattern to be r'[ ]?' and to make the json output a single line.
Fair. We can give it a try and see if we still get complaints from users.
I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.
Where do you see that training data for JSON is under the scheme of 4 whitespaces and one line break?
The reason I am asking about single liner (flat) JSON output is to save tokens. My intuition is that a well-formatted JSON is meant for humans and not LLM, LLM can deal with flat JSON structure in both input and output. A single line JSON which is very long and potentially nested can be very difficult for human to read but the equivalent is most probably not true for LLM.
This was addressed by #916, closing for now.
@rlouf @lapp0 Thanks for making the change.