datashare icon indicating copy to clipboard operation
datashare copied to clipboard

Apache Tika error on indexing: Cannot read JPEG2000 image

Open thf-alex opened this issue 1 year ago • 3 comments

Describe the bug I start a new instance of datashare in server mode according to instructions at https://icij.gitbook.io/datashare/server-mode/about-the-server-mode . After all services have started and I can connect to the web server from my browser, i try to perform optical character recognition and indexing via instructions at https://icij.gitbook.io/datashare/server-mode/add-documents-from-the-cli.

The process kicked off by this second command appears to index documents into datashare as expected. However I see many warnings/errors like

Apache Tika: myfilepdf] ERROR PDFStreamEngine - Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed

My questions are:

  • Since this error appears frequently will I be missing indexing large portions of document text?
  • Is there a simple way to add in necessary Image I/O tools to the docker image/build process?

Thank you

To Reproduce Please see description above. Example pdfs causing error can be provided if needed.

Expected behavior Expected behavior is all text in provided documents is OCR'ed and indexed

Desktop (please complete the following information):

  • OS: ubuntu Server 22.04
  • Version: e.g. 11.1.0

Thanks for your help!

thf-alex avatar Jan 18 '24 15:01 thf-alex

HI @thf-alex, thanks for the detailed report!

It looks like it might be an issue for https://github.com/apache/tika instead but I might be wrong. I'm surprised the error only appears when performing OCR. Maybe they are images embedded in your PDFs that are in a weird format?

Is their a chance you can share one of the PDF files with us? That would help a lot to sort what exactly is happening :)

pirhoo avatar Jan 18 '24 16:01 pirhoo

Hi @pirhoo. Yes I agree it likely has something to do with Tika. Thanks for your quick response!

Here's what I propose I will try to figure out a few of the files for which this error is thrown. Once the index job finishes (we are indexing several thousand docs) I will see if there is evidence that not all text in docs with error was indexed. If we find that to be the case, we will share some of the pdfs with you. If all seems indexed I will get back and let you know that despite messages it appears that everything worked. I anticipate indexing will take several more hours. Thanks again

thf-alex avatar Jan 18 '24 16:01 thf-alex

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] avatar Feb 28 '24 00:02 github-actions[bot]

This issue was closed because it has been inactive for 20 days since being marked as stale.

github-actions[bot] avatar Mar 19 '24 00:03 github-actions[bot]