doctr
doctr copied to clipboard
Line resolution isn't working on table-like structures
🚀 The feature
I have extract word by line , because when i am using block and then line part faced issues some word will not sequence wise.
Motivation, pitch
word in proper sequence order require
Alternatives
No response
Additional context
No response
Hi @devendrapal5755 :wave:
I'm sorry, I'm not sure I understand your problem here. Do you mean that the words are not ordered correctly in lines & blocks in your predictions?
Either way, could you share a minimal snippet to reproduce this behaviour (and the document/page you performed the inference on) please? This would allow us to reproduce this and investigate :pray:
Any update @devendrapal5755 ? :)
you can see model consider block - The lees deed and Rs-5,50,000/- same as premium stamp duty paid now . Actually i want same as it in image .
I'm currently trying to work around something similar. What would be the best way of extracting text similar to Tesseract's 'preserve-interword-spaces'?
Hi @devendrapal5755 @NGStaph :wave:
Sorry about the late reply! Would you mind sharing your input doc/image and elaborating on the expected vs actual result please? :pray:
on second thought.. i might be getting an error due to warp within '00093726.png' of the FUNSD dataset. the order of words that docTR outputs on a deskewed version of the image is unexpected/incorrect maybe related to #537 ?
Actual Order: 'Date: Soplembe r 21, 1/6', 'Sample No. 6030', 'Type of Cigarette', '85 mm Filter', '50 lbs.', 'Batch Size', 'Dr. A. W. Spears', 'Original Request Made By.', 'Sample Specifications Written By.', 'on September 21, 1976', 'W. E. Routh'.....
Hi @NGStaph,
I'm sorry but I don't think I get the difference between what you want and what you currently have :sweat_smile: So I'll share relevant information hoping that will help you:
- as you can see in https://github.com/mindee/doctr/blob/main/doctr/datasets/funsd.py#L95, only unknown characters are removed from the loaded data of FUNSD
- the convention selected by the dataset creator/owner is up to them, we have tried to keep from altering that
- in docTR, our predictors and document builders consider word as being uninterrupted sequences of characters (no white space)
Feel free to give more specifics about your problem if that doesn't help :) Also, if the topic is starting to different from the orginal issue description, please consider opening another dedicated issue :pray:
@frgfm , @charlesmindee , is line resolution provided? We have an example from README and the image from https://github.com/mindee/doctr/pull/537#issuecomment-950655246.

These worlds are totally different!
And I don't get a line via README's code. WDIDW?
Thanks!
Hi @kuraga :wave:
Circling back to this, here are some answers:
- the README snippet does not use line resolution (feel free to enable it)
- on the other picture, as you pointed it out, the spatial density of lines is quite different. The only robust method I can see is to use both spatial and semantic information (we verify our spatial understanding by checking whether the 2-part sentence makes sense). Since line resolution in docTR has been using heuristics so far, it's difficult to accommodate all distributions indeed :sweat_smile:
I suggest not trying to resolve lines and doing it on your own if you have a very specific spatial distribution :ok_hand:
Cheers!
@frgfm , thanks!
the README snippet does not use line resolution (feel free to enable it)
How do I enable it?
Hey @kuraga,
This is a constructor arg of the DocumentBuilder (https://github.com/mindee/doctr/blob/main/doctr/models/builder.py#L32-L33) that you can pass as kwargs to the ocr_predictor https://mindee.github.io/doctr/latest/modules/models.html#doctr.models.ocr_predictor
I hope this helps!
@frgfm , thanks!
P.S.
the README snippet does not use line resolution (feel free to enable it)
This is a constructor arg of the DocumentBuilder (https://github.com/mindee/doctr/blob/main/doctr/models/builder.py#L32-L33) that you can pass as kwargs to the ocr_predictor https://mindee.github.io/doctr/latest/modules/models.html#doctr.models.ocr_predictor
https://github.com/mindee/doctr/blob/1bf12a3ec73ddb463420d4243133b2d423a602d3/doctr/models/builder.py#L32-L33
Hm, it's enabled by default...
@kuraga it depends on your version of the library, but yes it's enabled by default on the dev version currently ;) I suggest passing that as a kwarg if you aren't using the latest dev version !
Closing this because looks like it's solved :) Otherwise feel free to answer on this thread for re-opening