Long `ExampleData` causes `extract` to hang on `_fuzzy_align_extraction`
Hi,
First of all, thanks for open sourcing this nice package. :-)
I am trying to use langextract to post-process interviews by tagging (sometimes long) quotes in a text.
The problem that I run into, is that the text extraction takes exceedingly long to complete.
For example, it takes 17 minutes (~1000 seconds) to analyze a text of 355 characters (59 words) using an examples list that contains a single ExampleData with:
- A
textof length: ~ 2,100 words (~ 13,000 characters) with 10Extractions withextraction_textsize of:- 6 characters
- 7 characters
- 7 characters
- 115 characters
- 496 characters
- 207 characters
- 84 characters
- 139 characters
- 36 characters
- 334 characters
(Unfortunately, I can not share the actual contents of the text for privacy reasons.)
Most of the time is spend before generating any output. When I terminate the program, it's always stuck at _fuzzy_align_extraction. After it starts spitting out output like this:
WARNING:absl:Prompt alignment: non-exact match:
the program quickly finishes. This is some corresponding output that is generated after the long silent period:
LangExtract: model=gemini-2.5-flash, current=358 chars, processed=358 chars: [00:09] ✓ Extraction processing complete INFO:absl:Finalizing annotation for document ID
. INFO:absl:Document annotation completed. ✓ Extracted 3 entities (1 unique types) • Time: 9.76s • Speed: 37 chars/sec • Chunks: 1
Any suggestions how to speed the extract function up?
Thanks in advance,
Hylke
One solution that worked well, is to chop the ExampleData document up into several ExampleData mini documents.
Hi @hcdonker-code - that makes sense as fuzzy align is not the most efficient matching algorithm right now (and is more of a fallback) so making things smaller can help efficiency a lot. Thanks for reporting the issue and temporary solution.
If you have more details and the before/after maybe with some outputs from your log that would be a useful reference for others. Eventually, having a more efficient fuzzy align is also something that will help the library's fallback robustness.