doctr icon indicating copy to clipboard operation
doctr copied to clipboard

Line resolution isn't working on table-like structures

Open devendrapal5755 opened this issue 3 years ago • 7 comments

🚀 The feature

I have extract word by line , because when i am using block and then line part faced issues some word will not sequence wise.

Motivation, pitch

word in proper sequence order require

Alternatives

No response

Additional context

No response

devendrapal5755 avatar Mar 07 '22 08:03 devendrapal5755

Hi @devendrapal5755 :wave:

I'm sorry, I'm not sure I understand your problem here. Do you mean that the words are not ordered correctly in lines & blocks in your predictions?

Either way, could you share a minimal snippet to reproduce this behaviour (and the document/page you performed the inference on) please? This would allow us to reproduce this and investigate :pray:

fg-mindee avatar Mar 07 '22 12:03 fg-mindee

Any update @devendrapal5755 ? :)

fg-mindee avatar Mar 10 '22 10:03 fg-mindee

you can see model consider block - The lees deed and Rs-5,50,000/- same as premium stamp duty paid now . Actually i want same as it in image .

devendrapal5755 avatar Mar 10 '22 11:03 devendrapal5755

I'm currently trying to work around something similar. What would be the best way of extracting text similar to Tesseract's 'preserve-interword-spaces'?

NGStaph avatar Jul 13 '22 11:07 NGStaph

Hi @devendrapal5755 @NGStaph :wave:

Sorry about the late reply! Would you mind sharing your input doc/image and elaborating on the expected vs actual result please? :pray:

frgfm avatar Jul 20 '22 09:07 frgfm

on second thought.. i might be getting an error due to warp within '00093726.png' of the FUNSD dataset. the order of words that docTR outputs on a deskewed version of the image is unexpected/incorrect maybe related to #537 ?

Actual Order: 'Date: Soplembe r 21, 1/6', 'Sample No. 6030', 'Type of Cigarette', '85 mm Filter', '50 lbs.', 'Batch Size', 'Dr. A. W. Spears', 'Original Request Made By.', 'Sample Specifications Written By.', 'on September 21, 1976', 'W. E. Routh'.....

NGStaph avatar Jul 21 '22 09:07 NGStaph

Hi @NGStaph,

I'm sorry but I don't think I get the difference between what you want and what you currently have :sweat_smile: So I'll share relevant information hoping that will help you:

  • as you can see in https://github.com/mindee/doctr/blob/main/doctr/datasets/funsd.py#L95, only unknown characters are removed from the loaded data of FUNSD
  • the convention selected by the dataset creator/owner is up to them, we have tried to keep from altering that
  • in docTR, our predictors and document builders consider word as being uninterrupted sequences of characters (no white space)

Feel free to give more specifics about your problem if that doesn't help :) Also, if the topic is starting to different from the orginal issue description, please consider opening another dedicated issue :pray:

frgfm avatar Jul 26 '22 11:07 frgfm

@frgfm , @charlesmindee , is line resolution provided? We have an example from README and the image from https://github.com/mindee/doctr/pull/537#issuecomment-950655246.

image image

These worlds are totally different!

And I don't get a line via README's code. WDIDW?

Thanks!

kuraga avatar Jan 25 '23 17:01 kuraga

Hi @kuraga :wave:

Circling back to this, here are some answers:

  • the README snippet does not use line resolution (feel free to enable it)
  • on the other picture, as you pointed it out, the spatial density of lines is quite different. The only robust method I can see is to use both spatial and semantic information (we verify our spatial understanding by checking whether the 2-part sentence makes sense). Since line resolution in docTR has been using heuristics so far, it's difficult to accommodate all distributions indeed :sweat_smile:

I suggest not trying to resolve lines and doing it on your own if you have a very specific spatial distribution :ok_hand:

Cheers!

frgfm avatar Apr 29 '23 09:04 frgfm

@frgfm , thanks!

the README snippet does not use line resolution (feel free to enable it)

How do I enable it?

kuraga avatar Apr 30 '23 14:04 kuraga

Hey @kuraga,

This is a constructor arg of the DocumentBuilder (https://github.com/mindee/doctr/blob/main/doctr/models/builder.py#L32-L33) that you can pass as kwargs to the ocr_predictor https://mindee.github.io/doctr/latest/modules/models.html#doctr.models.ocr_predictor

I hope this helps!

frgfm avatar May 28 '23 11:05 frgfm

@frgfm , thanks!

P.S.

the README snippet does not use line resolution (feel free to enable it)

This is a constructor arg of the DocumentBuilder (https://github.com/mindee/doctr/blob/main/doctr/models/builder.py#L32-L33) that you can pass as kwargs to the ocr_predictor https://mindee.github.io/doctr/latest/modules/models.html#doctr.models.ocr_predictor

https://github.com/mindee/doctr/blob/1bf12a3ec73ddb463420d4243133b2d423a602d3/doctr/models/builder.py#L32-L33

Hm, it's enabled by default...

kuraga avatar May 28 '23 12:05 kuraga

@kuraga it depends on your version of the library, but yes it's enabled by default on the dev version currently ;) I suggest passing that as a kwarg if you aren't using the latest dev version !

frgfm avatar May 28 '23 13:05 frgfm

Closing this because looks like it's solved :) Otherwise feel free to answer on this thread for re-opening

felixdittrich92 avatar Sep 17 '23 12:09 felixdittrich92