ocr-fileformat icon indicating copy to clipboard operation
ocr-fileformat copied to clipboard

Google Cloud Vision to PAGE-XML

Open kba opened this issue 5 years ago • 8 comments

It was mentioned before but @cneud just reminded me of https://github.com/PRImA-Research-Lab/cloud-vision-ocr-to-page . Should not be too hard to integrate and would allow using GCV results in OCR-D/Transkribus/OCR4all.

BTW: Has anyone experience with the Azure Computer Vision API in the context of OCR? As a sign of goodwill in times of Covid-19, they are currently offering a generous free tier including access to the vision API. Would be interesting to compare.

kba avatar Apr 29 '20 12:04 kba

BTW the existing integration of GCV as part of the PRImA converter (transform gcv page linking to alto page) is broken: it delegates to java -jar PageConverter.jar -source-xml $INFILE instead of java -jar PageConverter.jar -source-json $INFILE:

https://github.com/UB-Mannheim/ocr-fileformat/blob/8878b8aaed919f500e7ad0d33e881c9d872c4fb6/script/transform/alto__page#L19

bertsky avatar Nov 17 '22 15:11 bertsky

Thanks. So it was broken right from the beginning (commit 73328691c466057566db62d8cdbea8b26823bdbb).

stweil avatar Nov 17 '22 16:11 stweil

So it was broken right from the beginning (commit 7332869).

I'm not sure. Perhaps the PRImA convert was capable of detecting the format automatically before. But it does not look like it.

Anyway, here is a fix: https://github.com/UB-Mannheim/ocr-fileformat/pull/156

bertsky avatar Nov 17 '22 16:11 bertsky

I tried it with fixed arguments, and it fails:

java -jar vendor/JPageConverter/PageConverter.jar -neg-coords toZero -source-json 1850-Baptis-EMU-0204.txt -target-xml 1850-Baptis-EMU-0204.xml -convert-to LATEST
null
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "org.primaresearch.dla.page.Page.getLayout()" because "page" is null
	at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:449)
	at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:266)
	at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)

stweil avatar Nov 17 '22 16:11 stweil

I tried it with fixed arguments, and it fails:

I know. That's because in this example, the input data is incomplete. See here

bertsky avatar Nov 17 '22 16:11 bertsky

Since #156 we do have a working GCV converter here based on https://github.com/PRImA-Research-Lab/prima-page-converter, so there is no actual need for https://github.com/PRImA-Research-Lab/cloud-vision-ocr-to-page.

Comparing both implementations, IIUC we have:

implementation cloud-vision-ocr-to-page prima-page-converter with json input
external dependencies GCV (Java API) none (standalone)
usage online (network API) offline (JSON)
can also output ALTO no yes
yields @imageFilename yes no
yields width and height yes yes
coordinates bbox bbox
paragraphs recursive TextRegion recursive TextRegion
other region types Image+Separator+Graphic+Table Image+Separator+Graphic+Table
aggregate words to lines yes yes
confidence yes no

bertsky avatar Jun 06 '23 22:06 bertsky

Thanks for the comparison, very helpful.

implementation cloud-vision-ocr-to-page prima-page-converter with json input
external dependencies GCV (Java API) none (standalone)
usage online (network API) offline (JSON)

IMHO these are the strongest reasons against the cloud-vision-ocr-to-page approach.

It's unfortunate that the confidences aren't serialized, like gcv2hocr does with x_wconf for hOCR though, but with development largely stalled, nothing much we can do except rewrite ourselves.

kba avatar Jun 09 '23 15:06 kba

It's unfortunate that the confidences aren't serialized, like gcv2hocr does with x_wconf for hOCR though, but with development largely stalled, nothing much we can do except rewrite ourselves.

We can (fix ourselves and) ship our own builds. I have successfully set up Eclipse and can compile most of the modules (e.g. libs, PageViewer, PageConverter).

(I have done that with PageViewer including validator error messages.)

bertsky avatar Jun 09 '23 16:06 bertsky