Paper: Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
https://arxiv.org/pdf/2408.02442
Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation space impact LLMs’ abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs’ performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a significant decline in LLMs’ reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.
Would really love to hear your opinion on this @jxnl.
Would be interested in your reaction @filimoa.
BTW: I am a big believer in and user of structured output since the early days :-)
Perhaps this is better suited to the discussions tab
That being said, it's an interesting paper, I wonder if we can get best of both worlds by allowing more freeform reasoning followed by highly structured output or by interspersing them in multiturn conversations
That being said, it's an interesting paper, I wonder if we can get best of both worlds by allowing more freeform reasoning followed by highly structured output or by interspersing them in multiturn conversations
Related to https://github.com/jxnl/instructor/issues/888#issuecomment-2260184846 and https://github.com/jxnl/instructor/discussions/580#discussioncomment-9810506
We released a blog post on this : https://python.useinstructor.com/blog/2024/09/26/bad-schemas-could-break-your-llm-structured-outputs/
Tl;dr Structured Outputs are much more reliable in terms of reasoning/performance. JSON mode seems to exhibit much more variability ( ~ 1.5x more ) on the GSM8k test set we used. So depending on your use case, it's best to have an eval set and determine what works best for you.
Closing this for now.
We released a blog post on this : https://python.useinstructor.com/blog/2024/09/26/bad-schemas-could-break-your-llm-structured-outputs/
Nice study.
It would have been very nice to include "plain-text output" as part of the experiments, although I understand that that wouldn't fit well with the title "Bad Schemas could break your LLM Structured Outputs"
Even if the output is plain text, you can get structured output out of it. You just need to include something like this at the end of the prompt
# Output and Format:
reasoning: str # Reasoning behind your choice of answer
answer: int # The answer to the proposed problem
And then you throw the answer from the LLM back to the LLM (in a new api call) with the required output_format/json/structured output so you can process the output automatically.
After all, that is part of what "Let Me Speak Freely" is about.
@ivanleomk very nice post - thanks a lot! Quoting from "Let Me Speak Freely?":
Upon inspection, we found that 100% of GPT 3.5 Turbo JSON-mode responses placed the "answer" key before the "rea- son" key, resulting in zero-shot direct answering instead of zero-shot chain-of-thought reasoning.
Could this randomness in producing keys in JSON-mode at least in part explain the variability you are observing? Did you look at the order of the keys returned in JSON-mode?
Related question: do you (or anyone else here) know if tool calling always adheres to the order of parameters or is there a chance for reordering, too? I tend to say yes since an explicit JSON-schema is filled but maybe somebody here knows for sure.
Didn't look at the order of keys returned in JSON mode but in general I find JSON mode to be a bit more unstable and unreliable from anecdotal performance. Across multiple runs on different days, the performance with the same prompt and schema had varying performance whereas function calling remained stable.
W.r.t function calling, you can verify this by looking at a streamed function call, it always returns the fields in the order you specify them in
chain_of_thought=None name=None age=None
chain_of_thought=None name=None age=None
chain_of_thought=None name=None age=None
chain_of_thought=None name=None age=None
chain_of_thought='Ivan is 27 years old, which is a key piece of information
about him. He lives in Singapore, indicating his location. The task is to
extract relevant details into a structured format, which includes his name,
age, and a brief context about him.' name=None age=None
chain_of_thought='Ivan is 27 years old, which is a key piece of information
about him. He lives in Singapore, indicating his location. The task is to
extract relevant details into a structured format, which includes his name,
age, and a brief context about him.' name=None age=None
chain_of_thought='Ivan is 27 years old, which is a key piece of information
about him. He lives in Singapore, indicating his location. The task is to
extract relevant details into a structured format, which includes his name,
age, and a brief context about him.' name=None age=None
chain_of_thought='Ivan is 27 years old, which is a key piece of information
about him. He lives in Singapore, indicating his location. The task is to
extract relevant details into a structured format, which includes his name,
age, and a brief context about him.' name=None age=None
chain_of_thought='Ivan is 27 years old, which is a key piece of information
about him. He lives in Singapore, indicating his location. The task is to
extract relevant details into a structured format, which includes his name,
age, and a brief context about him.' name='Ivan' age=None
chain_of_thought='Ivan is 27 years old, which is a key piece of information
about him. He lives in Singapore, indicating his location. The task is to
extract relevant details into a structured format, which includes his name,
age, and a brief context about him.' name='Ivan' age=None
chain_of_thought='Ivan is 27 years old, which is a key piece of information
about him. He lives in Singapore, indicating his location. The task is to
extract relevant details into a structured format, which includes his name,
age, and a brief context about him.' name='Ivan' age=None
chain_of_thought='Ivan is 27 years old, which is a key piece of information
about him. He lives in Singapore, indicating his location. The task is to
extract relevant details into a structured format, which includes his name,
age, and a brief context about him.' name='Ivan' age=27
chain_of_thought='Ivan is 27 years old, which is a key piece of information
about him. He lives in Singapore, indicating his location. The task is to
extract relevant details into a structured format, which includes his name,
age, and a brief context about him.' name='Ivan' age=27
On Sat, Nov 9, 2024 at 4:12 AM Alexander Junge @.***> wrote:
@ivanleomk https://github.com/ivanleomk very nice post - thanks a lot! Quoting from "Let Me Speak Freely?":
Upon inspection, we found that 100% of GPT 3.5 Turbo JSON-mode responses placed the "answer" key before the "rea- son" key, resulting in zero-shot direct answering instead of zero-shot chain-of-thought reasoning.
Could this randomness in producing keys in JSON-mode at least in part explain the variability you are observing? Did you look at the order of the keys returned in JSON-mode?
Related question: do you (or anyone else here) know if tool calling always adheres to the order of parameters or is there a chance for reordering, too? I tend to say yes since an explicit JSON-schema is filled but maybe somebody here knows for sure.
— Reply to this email directly, view it on GitHub https://github.com/instructor-ai/instructor/issues/956#issuecomment-2465672335, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK5D6RX2ECDB7Y5ZIKDM4MDZ7ULJNAVCNFSM6AAAAABNEFV6W2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY3TEMZTGU . You are receiving this because you were mentioned.Message ID: @.***>