langextract icon indicating copy to clipboard operation
langextract copied to clipboard

Long `ExampleData` causes `extract` to hang on `_fuzzy_align_extraction`

Open hcdonker-code opened this issue 1 month ago • 2 comments

Hi,

First of all, thanks for open sourcing this nice package. :-)

I am trying to use langextract to post-process interviews by tagging (sometimes long) quotes in a text. The problem that I run into, is that the text extraction takes exceedingly long to complete. For example, it takes 17 minutes (~1000 seconds) to analyze a text of 355 characters (59 words) using an examples list that contains a single ExampleData with:

  • A text of length: ~ 2,100 words (~ 13,000 characters) with 10 Extractions with extraction_text size of:
    • 6 characters
    • 7 characters
    • 7 characters
    • 115 characters
    • 496 characters
    • 207 characters
    • 84 characters
    • 139 characters
    • 36 characters
    • 334 characters

(Unfortunately, I can not share the actual contents of the text for privacy reasons.) Most of the time is spend before generating any output. When I terminate the program, it's always stuck at _fuzzy_align_extraction. After it starts spitting out output like this:

WARNING:absl:Prompt alignment: non-exact match:

the program quickly finishes. This is some corresponding output that is generated after the long silent period:

LangExtract: model=gemini-2.5-flash, current=358 chars, processed=358 chars: [00:09] ✓ Extraction processing complete INFO:absl:Finalizing annotation for document ID . INFO:absl:Document annotation completed. ✓ Extracted 3 entities (1 unique types) • Time: 9.76s • Speed: 37 chars/sec • Chunks: 1

Any suggestions how to speed the extract function up?

Thanks in advance,

Hylke

hcdonker-code avatar Nov 04 '25 16:11 hcdonker-code

One solution that worked well, is to chop the ExampleData document up into several ExampleData mini documents.

hcdonker-code avatar Nov 05 '25 08:11 hcdonker-code

Hi @hcdonker-code - that makes sense as fuzzy align is not the most efficient matching algorithm right now (and is more of a fallback) so making things smaller can help efficiency a lot. Thanks for reporting the issue and temporary solution.

If you have more details and the before/after maybe with some outputs from your log that would be a useful reference for others. Eventually, having a more efficient fuzzy align is also something that will help the library's fallback robustness.

aksg87 avatar Nov 06 '25 16:11 aksg87