Feature Suggestion - schema based output
It would be good if the extraction output is based on a user-specified schema. The output has no structure now.
yes this was confusing for me. in some ways you "define" a schema but loosely through the flat string examples.
the class_name field in the examples makes it seem like a schema data class could be fit in there. i had to check the source code to see if i had misunderstood.
is there a reason that simple strings were chosen (maybe simplicity for models)? if not, is this something the team is open to supporting? an api that allows examples to be defined by a schema class and an instance of it to provide example data?
Adding my support to this suggestion! It would be really great to be able to specify an Enum of possible values for each attribute.
Hey @abishekchiffon,
This is a great idea and discussion. Ultimately, LX goes from user-defined prompt examples to the schema that is used for Controlled Generation. There could be multiple strategies for how this mapping is performed, or even for overriding the automatic schema that LX builds from the example.
With the recent refactor, I think it will be clean to support different strategies for building the final schema the model uses for a given prompt, ranging from flexible (for example, “I just want valid JSON”) to very specific (for example, “I only want these exact key-value pairs”). Note, developers can also implement their own custom schema handling now per #99.
I personally favor keeping things more flexible with the schema, since models are getting better at following prompt-based specifications. If you over constrain the output, you can end up “fighting” the model’s generated tokens, which sometimes leads to edge cases such as this infinite decoding loop.
Any workaround for this? I would love to have structured output somehow
Yeah the enum is a necessity here, i'm getting a lot of failing examples where the model doesn't follow the prompt... but I know if I provided a defined schema it would.