doctr Line resolution isn't working on table-like structures

🚀 The feature

I have extract word by line , because when i am using block and then line part faced issues some word will not sequence wise.

Motivation, pitch

word in proper sequence order require

Alternatives

No response

Additional context

No response

Mar 07 '22 08:03 devendrapal5755

Hi @devendrapal5755 :wave:

I'm sorry, I'm not sure I understand your problem here. Do you mean that the words are not ordered correctly in lines & blocks in your predictions?

Either way, could you share a minimal snippet to reproduce this behaviour (and the document/page you performed the inference on) please? This would allow us to reproduce this and investigate :pray:

Mar 07 '22 12:03 fg-mindee

Any update @devendrapal5755 ? :)

Mar 10 '22 10:03 fg-mindee

you can see model consider block - The lees deed and Rs-5,50,000/- same as premium stamp duty paid now . Actually i want same as it in image .

Mar 10 '22 11:03 devendrapal5755

I'm currently trying to work around something similar. What would be the best way of extracting text similar to Tesseract's 'preserve-interword-spaces'?

Jul 13 '22 11:07 NGStaph

Hi @devendrapal5755 @NGStaph :wave:

Sorry about the late reply! Would you mind sharing your input doc/image and elaborating on the expected vs actual result please? :pray:

Jul 20 '22 09:07 frgfm

on second thought.. i might be getting an error due to warp within '00093726.png' of the FUNSD dataset. the order of words that docTR outputs on a deskewed version of the image is unexpected/incorrect maybe related to #537 ?

Actual Order: 'Date: Soplembe r 21, 1/6', 'Sample No. 6030', 'Type of Cigarette', '85 mm Filter', '50 lbs.', 'Batch Size', 'Dr. A. W. Spears', 'Original Request Made By.', 'Sample Specifications Written By.', 'on September 21, 1976', 'W. E. Routh'.....

Jul 21 '22 09:07 NGStaph

Hi @NGStaph,

I'm sorry but I don't think I get the difference between what you want and what you currently have :sweat_smile: So I'll share relevant information hoping that will help you:

as you can see in https://github.com/mindee/doctr/blob/main/doctr/datasets/funsd.py#L95, only unknown characters are removed from the loaded data of FUNSD
the convention selected by the dataset creator/owner is up to them, we have tried to keep from altering that
in docTR, our predictors and document builders consider word as being uninterrupted sequences of characters (no white space)

Feel free to give more specifics about your problem if that doesn't help :) Also, if the topic is starting to different from the orginal issue description, please consider opening another dedicated issue :pray:

Jul 26 '22 11:07 frgfm

@frgfm , @charlesmindee , is line resolution provided? We have an example from README and the image from https://github.com/mindee/doctr/pull/537#issuecomment-950655246.

These worlds are totally different!

And I don't get a line via README's code. WDIDW?

Thanks!

Jan 25 '23 17:01 kuraga

Hi @kuraga :wave:

Circling back to this, here are some answers:

the README snippet does not use line resolution (feel free to enable it)
on the other picture, as you pointed it out, the spatial density of lines is quite different. The only robust method I can see is to use both spatial and semantic information (we verify our spatial understanding by checking whether the 2-part sentence makes sense). Since line resolution in docTR has been using heuristics so far, it's difficult to accommodate all distributions indeed :sweat_smile:

I suggest not trying to resolve lines and doing it on your own if you have a very specific spatial distribution :ok_hand:

Cheers!

Apr 29 '23 09:04 frgfm

@frgfm , thanks!

the README snippet does not use line resolution (feel free to enable it)

How do I enable it?

Apr 30 '23 14:04 kuraga

Hey @kuraga,

This is a constructor arg of the DocumentBuilder (https://github.com/mindee/doctr/blob/main/doctr/models/builder.py#L32-L33) that you can pass as kwargs to the ocr_predictor https://mindee.github.io/doctr/latest/modules/models.html#doctr.models.ocr_predictor

I hope this helps!

May 28 '23 11:05 frgfm

@frgfm , thanks!

P.S.

the README snippet does not use line resolution (feel free to enable it)

This is a constructor arg of the DocumentBuilder (https://github.com/mindee/doctr/blob/main/doctr/models/builder.py#L32-L33) that you can pass as kwargs to the ocr_predictor https://mindee.github.io/doctr/latest/modules/models.html#doctr.models.ocr_predictor

https://github.com/mindee/doctr/blob/1bf12a3ec73ddb463420d4243133b2d423a602d3/doctr/models/builder.py#L32-L33

Hm, it's enabled by default...

May 28 '23 12:05 kuraga

@kuraga it depends on your version of the library, but yes it's enabled by default on the dev version currently ;) I suggest passing that as a kwarg if you aren't using the latest dev version !

May 28 '23 13:05 frgfm

Closing this because looks like it's solved :) Otherwise feel free to answer on this thread for re-opening

Sep 17 '23 12:09 felixdittrich92

doctr doctr copied to clipboard

Line resolution isn't working on table-like structures

🚀 The feature

Motivation, pitch

Alternatives

Additional context

doctr
doctr copied to clipboard