fscrawler icon indicating copy to clipboard operation
fscrawler copied to clipboard

ERROR [o.a.p.c.PDFStreamEngine] Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed

Open markusgjanssen opened this issue 1 year ago • 7 comments

Hi,

Describe the bug

when reading a PDF file with images this error occurs:

ERROR [o.a.p.c.PDFStreamEngine] Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed

When calling the REST end-point with a PDF with a image i got the following result:

<html><head><title>Grizzly 3.0.0</title><style><!--div.header {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#003300;font-size:22px;-moz-border-radius-topleft: 10px;border-top-left-radius: 10px;-moz-border-radius-topright: 10px;border-top-right-radius: 10px;padding-left: 5px}div.body {font-family:Tahoma,Arial,sans-serif;color:black;background-color:#FFFFCC;font-size:16px;padding-top:10px;padding-bottom:10px;padding-left:10px}div.footer {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#666633;font-size:14px;-moz-border-radius-bottomleft: 10px;border-bottom-left-radius: 10px;-moz-border-radius-bottomright: 10px;border-bottom-right-radius: 10px;padding-left: 5px}BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}B {font-family:Tahoma,Arial,sans-serif;color:black;}A {color : black;}HR {color : #999966;}--></style> </head><body><div class="header">Request failed.</div><div class="body">Request failed.</div><div class="footer">Grizzly 3.0.0</div></body></html>

I guess this is the same issue but the error not occurs in the Log file.

I start the docker with the following command:

sudo docker run -it --rm -p 8080:8080 -h <ip adress> -v ~/.fscrawler:/root/.fscrawler -v ~/tmp:/tmp/es:ro dadoonet/fscrawler fscrawler documents --rest

Job Settings

the _setting.yaml (i played with indexed_chars but that does not help):

name: "documents"
fs:
  url: "/tmp/es"
  update_rate: "2m"
  excludes:
  - "*/~*"
  json_support: false
  indexed_chars: "-1"
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: true
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: true
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "https://<url to elastic cloud>"
  username: "elastic"
  password: "<password of elastic>"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: true
rest:
  url: "http://0.0.0.0:8080/fscrawler"
  enable_cors: true

Logs

09:58:57,762 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [2.6mb/233.9mb=1.15%], RAM [296.5mb/964.5mb=30.75%], Swap [0b/0b=0.0].
09:58:58,406 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
09:58:58,407 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
09:58:59,988 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.2.0
09:59:00,171 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.2.0
09:59:00,225 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [documents] for [/tmp/es] every [2m]
09:59:00,461 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
09:59:00,647 WARN  [o.g.j.s.w.WadlFeature] JAXBContext implementation could not be found. WADL feature is disabled.
09:59:00,934 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi will be ignored.
09:59:00,937 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi will be ignored.
09:59:00,938 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi will be ignored.
09:59:01,710 INFO  [f.p.e.c.f.r.RestServer] FS crawler Rest service started on [http://0.0.0.0:8080/fscrawler]
09:59:03,326 ERROR [o.a.p.c.PDFStreamEngine] Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
09:59:13,255 ERROR [o.a.p.c.PDFStreamEngine] Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed

Expected behavior

Read the PDF and put the text in Elastic

Versions:

  • OS: AWS Linux
  • Version 2.10-SNAPSHOT Docker

Attachment

If the bug is related to a given file, please share this file so we can reuse it in tests to reproduce the problem and may be use it in our integration tests.

markusgjanssen avatar Jul 12 '22 10:07 markusgjanssen

I think it's because I made the libraries optionals. They should be part of the distribution I guess..

I need to check this more.

dadoonet avatar Aug 22 '22 15:08 dadoonet

We were able to work around this by adding a lib that brings the JAI API/libs in.

In the distrubtion/pom.xml file we simply modified the langsPkg argument to look like the following:

<docker.ocr.args.langsPkg>imagemagick tesseract-ocr tesseract-ocr-all libpixelmed-imageio-java</docker.ocr.args.langsPkg>

24601 avatar Oct 06 '22 22:10 24601

@24601 Would you lik to contribute this change as a new PR?

dadoonet avatar Oct 07 '22 03:10 dadoonet

@dadoonet - I could and would be happy to, but I believe the JPEG2000 libs have a license issue (I think you may have written about it in the docs?), and I would not want to submit a PR that may risk license issues and/or expose the project to certain liabilities. Both the JPEG2000 spec itself but also Oracle's various IP around the Java API which they've been really nasty with before. Do you have a position on this?

I would be happy to submit a PR that would document or provide a convenience script rather than automatically allowing this/including this that would include a warning to the user and acknowledgement. I'm not a lawyer, but this is perhaps the best course of action IMO.

I am also vetting the TwelveMonkeys library which is a JAI implementation with both TIFF and JPEG Lossless as a drop-in replacement, but it does not specifically list JPEG2000 support, only JPEG Lossless which is part of the 2000 spec but not the whole spec....so it may not pan out. Testing now.

Information: from the JJ2000 license:

Those intending to use this software module in hardware or software products are advised that their use may infringe existing patents. The original developers of this software module, JJ2000 Partners and ISO/IEC assume no liability for use of this software module or modifications thereof. No license or right to this software module is granted for non JPEG 2000 Standard conforming products.

24601 avatar Oct 07 '22 21:10 24601

I think we can provide another Docker image but under the original lib license. So users will choose if they can use it or not. WDYT?

default images under Apache2 Jpeg images under the new license

I'm not totally happy with this as this will generate even more images...

dadoonet avatar Oct 07 '22 21:10 dadoonet

I think we can provide another Docker image but under the original lib license. So users will choose if they can use it or not. WDYT?

default images under Apache2 Jpeg images under the new license

I'm not totally happy with this as this will generate even more images...

That sounds good, I will vet which library seems to be the best (I have tested my solution above for 24 hrs but still seeing a few edge cases on other formats) and submit a PR hopefully early next week.

Merci!

24601 avatar Oct 07 '22 21:10 24601

An update on this, @dadoonet - I have a fix where the error is now just a warning and it works a bit better, but there are still some cases with JPX images in PDFs that I am trying to address with proper libraries.

24601 avatar Oct 12 '22 22:10 24601