tika-python
tika-python copied to clipboard
Docker Tika-server PDF OCR
Can someone assist? I am trying to get tika-python to return json with metadata and text when using the docker image of tika. I can get the results I want using the curl command, but not with python, which returns only empty content.
RESULTS for a 2 page none searchable PDF:
PYTHON:
headers = {"X-Tika-PDFextractInlineImages": "true", "X-Tika-PDFocrStrategy": "OCR_ONLY"}
parsed = parser.from_file(
"sample_notext.pdf",
serverEndpoint="http://localhost:9998/rmeta",
headers=headers,
)
print(parsed["content"])
sample_notext.pdf"
CURL:
>curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_ONLY" -H "Accept: application/json" -T sample_notext.pdf localhost:9998/tika | json_pp
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 81740 0 6296 100 75444 1253 15025 0:00:05 0:00:05 --:--:-- 1648
{
"Author" : "rober",
"Content-Type" : "application/pdf",
"Creation-Date" : "2022-02-26T15:38:16Z",
"Last-Modified" : "2022-02-26T15:38:16Z",
"Last-Save-Date" : "2022-02-26T15:38:16Z",
"X-Parsed-By" : [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser",
"class org.apache.tika.parser.ocr.TesseractOCRParser"
],
"X-TIKA:content" : "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<meta name=\"date\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"pdf:PDFVersion\" content=\"1.7\" />\n<meta name=\"pdf:docinfo:title\" content=\"sample_notext.pdf\" />\n<meta name=\"pdf:hasXFA\" content=\"false\" />\n<meta name=\"access_permission:modify_annotations\" content=\"true\" />\n<meta name=\"access_permission:can_print_degraded\" content=\"true\" />\n<meta name=\"dc:creator\" content=\"rober\" />\n<meta name=\"dcterms:created\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"Last-Modified\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"dcterms:modified\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"dc:format\" content=\"application/pdf; version=1.7\" />\n<meta name=\"Last-Save-Date\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"access_permission:fill_in_form\" content=\"true\" />\n<meta name=\"pdf:docinfo:modified\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"meta:save-date\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"pdf:encrypted\" content=\"false\" />\n<meta name=\"dc:title\" content=\"sample_notext.pdf\" />\n<meta name=\"modified\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"pdf:hasMarkedContent\" content=\"false\" />\n<meta name=\"Content-Type\" content=\"application/pdf\" />\n<meta name=\"pdf:docinfo:creator\" content=\"rober\" />\n<meta name=\"X-Parsed-By\" content=\"org.apache.tika.parser.DefaultParser\" />\n<meta name=\"X-Parsed-By\" content=\"org.apache.tika.parser.pdf.PDFParser\" />\n<meta name=\"X-Parsed-By\" content=\"class org.apache.tika.parser.ocr.TesseractOCRParser\" />\n<meta name=\"creator\" content=\"rober\" />\n<meta name=\"meta:author\" content=\"rober\" />\n<meta name=\"meta:creation-date\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"created\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"access_permission:extract_for_accessibility\" content=\"true\" />\n<meta name=\"access_permission:assemble_document\" content=\"true\" />\n<meta name=\"xmpTPg:NPages\" content=\"2\" />\n<meta name=\"Creation-Date\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"pdf:hasXMP\" content=\"false\" />\n<meta name=\"access_permission:extract_content\" content=\"true\" />\n<meta name=\"access_permission:can_print\" content=\"true\" />\n<meta name=\"Author\" content=\"rober\" />\n<meta name=\"producer\" content=\"Microsoft: Print To PDF\" />\n<meta name=\"access_permission:can_modify\" content=\"true\" />\n<meta name=\"pdf:docinfo:producer\" content=\"Microsoft: Print To PDF\" />\n<meta name=\"pdf:docinfo:created\" content=\"2022-02-26T15:38:16Z\" />\n<title>sample_notext.pdf</title>\n</head>\n<body><div class=\"page\"><div class=\"ocr\">A Simple PDF File\n\nThis is a small demonstration .pdf file -\n\njust for use in the Virtual Mechanics tutorials. More text. And more\ntext. And more text. And more text. And more text.\n\nAnd more text. And more text. And more text. And more text. And more\ntext. And more text. Boring, zzzzz. And more text. And more text. And\nmore text. And more text. And more text. And more text. And more text.\nAnd more text. And more text.\n\nAnd more text. And more text. And more text. And more text. And more\ntext. And more text. And more text. Even more. Continued on page 2...\n</div>\n</div>\n<div class=\"page\"><div class=\"ocr\">simple PDF File 2\n\n...continued from page 1. Yet more text. And more text. And more text.\nAnd more text. And more text. And more text. And more text. And more\ntext. Oh, how boring typing this stuff. But not as boring as watching\npaint dry. And more text. And more text. And more text. And more text.\nBoring. More, a little more text. The end, and just as well.\n</div>\n</div>\n</body></html>",
"access_permission:assemble_document" : "true",
"access_permission:can_modify" : "true",
"access_permission:can_print" : "true",
"access_permission:can_print_degraded" : "true",
"access_permission:extract_content" : "true",
"access_permission:extract_for_accessibility" : "true",
"access_permission:fill_in_form" : "true",
"access_permission:modify_annotations" : "true",
"created" : "2022-02-26T15:38:16Z",
"creator" : "rober",
"date" : "2022-02-26T15:38:16Z",
"dc:creator" : "rober",
"dc:format" : "application/pdf; version=1.7",
"dc:title" : "sample_notext.pdf",
"dcterms:created" : "2022-02-26T15:38:16Z",
"dcterms:modified" : "2022-02-26T15:38:16Z",
"meta:author" : "rober",
"meta:creation-date" : "2022-02-26T15:38:16Z",
"meta:save-date" : "2022-02-26T15:38:16Z",
"modified" : "2022-02-26T15:38:16Z",
"pdf:PDFVersion" : "1.7",
"pdf:charsPerPage" : [
"0",
"0"
],
"pdf:docinfo:created" : "2022-02-26T15:38:16Z",
"pdf:docinfo:creator" : "rober",
"pdf:docinfo:modified" : "2022-02-26T15:38:16Z",
"pdf:docinfo:producer" : "Microsoft: Print To PDF",
"pdf:docinfo:title" : "sample_notext.pdf",
"pdf:encrypted" : "false",
"pdf:hasMarkedContent" : "false",
"pdf:hasXFA" : "false",
"pdf:hasXMP" : "false",
"pdf:unmappedUnicodeCharsPerPage" : [
"0",
"0"
],
"producer" : "Microsoft: Print To PDF",
"title" : "sample_notext.pdf",
"xmpTPg:NPages" : "2"
Update: I have now been able to return both metadata and content in a single JSON dump, but I am still not able to force OCR using the Python CLI or API.
headers = {"X-Tika-PDFextractInlineImages": "true", "X-Tika-PDFocrStrategy": "OCR_ONLY"}
parsed = parser.from_file(
"sample_notext.pdf",
serverEndpoint="http://localhost:9998/rmeta",
service="all",
headers=headers,
)
pretty_parsed = json.dumps(parsed, indent=2)
print(pretty_parsed)
Returns:
{
"metadata": {
"Author": "rober",
"Content-Type": "application/pdf",
"Creation-Date": "2022-02-26T15:38:16Z",
"Last-Modified": "2022-02-26T15:38:16Z",
"Last-Save-Date": "2022-02-26T15:38:16Z",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser",
"class org.apache.tika.parser.ocr.TesseractOCRParser"
],
"X-TIKA:EXCEPTION:runtime": "org.apache.commons.io.IOExceptionWithCause: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly\n\tat org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:100)\n\tat org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:963)\n\tat org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)\n\tat org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:66)\n\tat org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:167)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:233)\n\tat org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:409)\n\tat org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:147)\n\tat org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:123)\n\tat jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)\n\tat org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)\n\tat org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)\n\tat org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)\n\tat org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)\n\tat org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)\n\tat org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)\n\tat org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)\n\tat org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)\n\tat org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)\n\tat org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:500)\n\tat org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)\n\tat org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)\n\tat org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)\n\tat org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:388)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\nCaused by: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly\n\tat org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:433)\n\tat org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:97)\n\t... 49 more\n",
"X-TIKA:content_handler": "ToTextContentHandler",
"X-TIKA:embedded_depth": "0",
"X-TIKA:parse_time_millis": "8",
"access_permission:assemble_document": "true",
"access_permission:can_modify": "true",
"access_permission:can_print": "true",
"access_permission:can_print_degraded": "true",
"access_permission:extract_content": "true",
"access_permission:extract_for_accessibility": "true",
"access_permission:fill_in_form": "true",
"access_permission:modify_annotations": "true",
"created": "2022-02-26T15:38:16Z",
"creator": "rober",
"date": "2022-02-26T15:38:16Z",
"dc:creator": "rober",
"dc:format": "application/pdf; version=1.7",
"dc:title": "sample_notext.pdf",
"dcterms:created": "2022-02-26T15:38:16Z",
"dcterms:modified": "2022-02-26T15:38:16Z",
"meta:author": "rober",
"meta:creation-date": "2022-02-26T15:38:16Z",
"meta:save-date": "2022-02-26T15:38:16Z",
"modified": "2022-02-26T15:38:16Z",
"pdf:PDFVersion": "1.7",
"pdf:docinfo:created": "2022-02-26T15:38:16Z",
"pdf:docinfo:creator": "rober",
"pdf:docinfo:modified": "2022-02-26T15:38:16Z",
"pdf:docinfo:producer": "Microsoft: Print To PDF",
"pdf:docinfo:title": "sample_notext.pdf",
"pdf:encrypted": "false",
"pdf:hasMarkedContent": "false",
"pdf:hasXFA": "false",
"pdf:hasXMP": "false",
"producer": "Microsoft: Print To PDF",
"resourceName": "b'sample_notext.pdf'",
"title": "sample_notext.pdf",
"xmpTPg:NPages": "2"
},
"content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nsample_notext.pdf\n\n",
"status": 200
}
Make sure you are using the tika image that contains ocr, currently the latest available is: 1.28.2-full
Just as a note, if you use X-Tika-PDFextractInlineImages and X-Tika-PDFocrStrategy at the same time, both will be executed and it may slow down the text extraction.
Note: These two options are independent. If you set extractInlineImages to true and select an OcrStrategy that includes OCR on the rendered page, Tika will run OCR on the extracted inline images and the rendered page.
@mfernaal is right, I think it had to do with the docker image that was being used. Thanks.