amazon-textract-textractor getting text from detect_document

getting text from detect_document_text

Open bvbg1 opened this issue 2 years ago • 3 comments

How can I get the text in natural reading order (left to right) with detect_document_text with line break info?

Example image: test

document.text output:

quick a brown fox
jumps over the lazy dog
word3
word4
word5 word7
word1 word2
word8word9 word10

document.lines output: [quick a brown fox, jumps over the lazy dog, word3, word4, word5 word7, word1 word2, word8word9 word10]

document.words output: [a, brown, the, fox, over, jumps, dog, quick, lazy, word7, word3, word4, word2, word5, word1, word8word9, word10]

Feb 21 '23 03:02 bvbg1