langextract icon indicating copy to clipboard operation
langextract copied to clipboard

Feature Suggestion - schema based output

Open abishekchiffon opened this issue 4 months ago • 5 comments

It would be good if the extraction output is based on a user-specified schema. The output has no structure now.

abishekchiffon avatar Aug 12 '25 20:08 abishekchiffon

yes this was confusing for me. in some ways you "define" a schema but loosely through the flat string examples.

the class_name field in the examples makes it seem like a schema data class could be fit in there. i had to check the source code to see if i had misunderstood.

is there a reason that simple strings were chosen (maybe simplicity for models)? if not, is this something the team is open to supporting? an api that allows examples to be defined by a schema class and an instance of it to provide example data?

the-vampiire avatar Aug 13 '25 14:08 the-vampiire

Adding my support to this suggestion! It would be really great to be able to specify an Enum of possible values for each attribute.

LuisBrazMelo avatar Aug 17 '25 06:08 LuisBrazMelo

Hey @abishekchiffon,

This is a great idea and discussion. Ultimately, LX goes from user-defined prompt examples to the schema that is used for Controlled Generation. There could be multiple strategies for how this mapping is performed, or even for overriding the automatic schema that LX builds from the example.

With the recent refactor, I think it will be clean to support different strategies for building the final schema the model uses for a given prompt, ranging from flexible (for example, “I just want valid JSON”) to very specific (for example, “I only want these exact key-value pairs”). Note, developers can also implement their own custom schema handling now per #99.

I personally favor keeping things more flexible with the schema, since models are getting better at following prompt-based specifications. If you over constrain the output, you can end up “fighting” the model’s generated tokens, which sometimes leads to edge cases such as this infinite decoding loop.

aksg87 avatar Aug 25 '25 01:08 aksg87

Any workaround for this? I would love to have structured output somehow

irux avatar Oct 07 '25 02:10 irux

Yeah the enum is a necessity here, i'm getting a lot of failing examples where the model doesn't follow the prompt... but I know if I provided a defined schema it would.

eakertFacet avatar Dec 01 '25 20:12 eakertFacet