s3-ocr
s3-ocr copied to clipboard
Expose difference between HANDWRITING and PRINTED and so on
I just noticed that once you get down to the WORD
blocks in the Textract output you see stuff this:
{
"BlockType": "WORD",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": 99.53694915771484,
"DocumentType": null,
"EntityTypes": null,
"Geometry": {
"BoundingBox": {
"Height": 0.006135384552180767,
"Left": 0.5047398209571838,
"Top": 0.8192090392112732,
"Width": 0.01691332273185253
},
"Polygon": [
{
"X": 0.5048662424087524,
"Y": 0.8192090392112732
},
{
"X": 0.5216531157493591,
"Y": 0.8194430470466614
},
{
"X": 0.5215266942977905,
"Y": 0.825344443321228
},
{
"X": 0.5047398209571838,
"Y": 0.8251104354858398
}
]
},
"Hint": null,
"Id": "dc1f6337-125f-46df-aac6-7886c37f93d3",
"Page": 2,
"Query": null,
"Relationships": null,
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": "used,",
"TextType": "PRINTED"
},
{
"BlockType": "WORD",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": 54.271095275878906,
"DocumentType": null,
"EntityTypes": null,
"Geometry": {
"BoundingBox": {
"Height": 0.006719755940139294,
"Left": 0.5231310725212097,
"Top": 0.8192643523216248,
"Width": 0.00816002581268549
},
"Polygon": [
{
"X": 0.5232726335525513,
"Y": 0.8192643523216248
},
{
"X": 0.5312910676002502,
"Y": 0.8193761110305786
},
{
"X": 0.5311495065689087,
"Y": 0.8259841203689575
},
{
"X": 0.5231310725212097,
"Y": 0.8258723020553589
}
]
},
"Hint": null,
"Id": "a9b0c6d6-d6c8-4b0e-8203-8864bc40305e",
"Page": 2,
"Query": null,
"Relationships": null,
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": "2/2",
"TextType": "HANDWRITING"
},
The "TextType": "PRINTED"
and "TextType": "HANDWRITING"
things are cool! This tool currently ignores words and only uses lines, so this information is lost.
This might come out of work to provide options to create a schema that collects data at a higher level of detail generally.