langextract icon indicating copy to clipboard operation
langextract copied to clipboard

Add `config` and `model` parameters to `extract()`

Open aksg87 opened this issue 4 months ago • 4 comments

Add config and model parameters to extract()

Hey @mariano, creating an issue about the feature you proposed - I think it's a great idea and was something I had in mind, so please feel free to send a PR if you're interested!

Following up from #99, the idea is to let extract() accept either a ModelConfig or a pre-instantiated model directly for explicit provider selection and model reuse.

The API would look like:

# Pass a config
config = factory.ModelConfig(
    model_id="gemini-2.5-flash",
    provider="GeminiLanguageModel",
    provider_kwargs={"api_key": "...", "temperature": 0.7}
)
result = extract(text_or_documents=content, config=config, ...)

# Or pass a model directly  
model = factory.create_model(config)
result = extract(text_or_documents=content, model=model, ...)

Precedence would be: model > config > model_id+kwargs > language_model_type (deprecated).

We'd need tests for precedence, factory feature preservation, and backward compat. This would complement the factory infrastructure from #97 in a nice way.

Could also be interesting to support loading configs from JSON/YAML files down the road - would make it easy to manage different model configurations.

aksg87 avatar Aug 09 '25 05:08 aksg87

@aksg87 Perfect. I'll create the PR tomorrow

mariano avatar Aug 10 '25 22:08 mariano

Sounds great, thanks!

aksg87 avatar Aug 11 '25 02:08 aksg87

@aksg87 Regarding the confirmation file approach it sounds great. toml feels better for future-proofing the configuration in a text-friendly way, but yaml and json are definitely more popular and less pythoniac 🤓. Opens the door for some fun use cases where the extraction settings come from LLM generated toml, specific for the job at hand

mariano avatar Aug 11 '25 20:08 mariano

Hi @mariano, I agree, tomllib is a great option, especially since it's part of the standard library. You make a good point that while the current configs are simple, using a dedicated file format is a great way to plan for future complexity.

On a related note, I've seen a few issues where users can't pass specific parameters to providers like Ollama. A very useful contribution would be to implement a generalized fix that allows any extra parameters in the config to be passed through to the underlying model. Let me know if that's something you'd be interested in exploring!

aksg87 avatar Aug 21 '25 05:08 aksg87