gcv2hocr icon indicating copy to clipboard operation
gcv2hocr copied to clipboard

gcv2ocr.py does not convert json

Open sarepal opened this issue 4 years ago • 6 comments

I'm working with the attached JSON file from GCV but when I run the gcv2ocr.py, the hocr only has metadata and lacks content. osh-sample-1911a-0001.json.zip

sarepal avatar May 27 '20 15:05 sarepal

Thank you for your report. Did you use gcvocr.sh to get json file ?

dinosauria123 avatar May 30 '20 14:05 dinosauria123

No, I used a script based on a Google Cloud Vision tutorial. I'll look into using the shell script instead.

sarepal avatar Jun 02 '20 18:06 sarepal

@sarepal @dinosauria123 Any update on how to convert above attached json file to hocr. Thanks in advance

svamsip avatar Jul 08 '20 11:07 svamsip

Update: I got the correct API key to generate the json using gcvocr.sh and was able to convert it to hocr with gcv2ocr.py.

However, I noticed in the hocr output that there is a <span class='ocr_line'....> around every word instead of every line of text.

@dinosauria123 does gcv2ocr.py only deal with the data in the json's "textAnnotations" and not the data in "fullTextAnnotation"? Thanks.

sarepal avatar Nov 18 '20 21:11 sarepal

I see that gcv2hocr2.py does handle fullTextAnnotation. When I try to run it this is the output I receive:

python ../gcv2hocr2.py osh-sample-1911a-0001.jpg.json > output.hocr

Traceback (most recent call last):
  File "../gcv2hocr2.py", line 184, in <module>
    page = fromResponse(resp, str(args.gcv_file.rsplit('.',1)[0]), **args.__dict__)
  File "../gcv2hocr2.py", line 103, in fromResponse
    for page_id, page_json in enumerate(resp['fullTextAnnotation']['pages']):
KeyError: 'fullTextAnnotation'

The JSON does contain a fullTextAnnotation object so I don't know why this error would occur. I'm attaching the JSON I tried to process. If there's a way to get this script to successfully run, I would be very grateful. Thanks again. osh-sample-1911a-0001.jpg.json.zip

sarepal avatar Nov 19 '20 16:11 sarepal

UPDATE: I now have gcv2hocr2.py working. I just edited line 103 to this and it worked:

for page_id, page_json in enumerate(resp['responses'][0]['fullTextAnnotation']['pages']):

sarepal avatar Nov 19 '20 17:11 sarepal