ocr-fileformat icon indicating copy to clipboard operation
ocr-fileformat copied to clipboard

Support conversion from and to Textract JSON

Open scottschreckengaust opened this issue 5 years ago • 4 comments

Textract has an output results format in JSON.

https://docs.aws.amazon.com/textract/latest/dg/textract-dg.pdf

Specifically, the three types of analysis, https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html for the categories:

  1. text,
  2. forms, and
  3. tables

scottschreckengaust avatar Jan 30 '20 23:01 scottschreckengaust

Conversion from Textract to PAGE XML was now added with pull request #160.

stweil avatar May 05 '23 13:05 stweil

Alas, the new converter is still incomplete, so

  • forms, and
  • tables

do not work yet. See https://github.com/slub/textract2page/issues/2

bertsky avatar Jun 06 '23 14:06 bertsky

Update: tables work now, but the converter submodule needs to be updated here

bertsky avatar Aug 16 '23 15:08 bertsky

Update: tables work now, but the converter submodule needs to be updated here

I've updated the vendor submodules, including textract2page in https://github.com/UB-Mannheim/ocr-fileformat/pull/166. The tables branch is not yet merged to master though and I think there are files missing to properly run the tests.

kba avatar Sep 06 '23 15:09 kba