fscrawler
fscrawler copied to clipboard
ERROR [o.a.p.c.PDFStreamEngine] Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Hi,
Describe the bug
when reading a PDF file with images this error occurs:
ERROR [o.a.p.c.PDFStreamEngine] Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
When calling the REST end-point with a PDF with a image i got the following result:
<html><head><title>Grizzly 3.0.0</title><style><!--div.header {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#003300;font-size:22px;-moz-border-radius-topleft: 10px;border-top-left-radius: 10px;-moz-border-radius-topright: 10px;border-top-right-radius: 10px;padding-left: 5px}div.body {font-family:Tahoma,Arial,sans-serif;color:black;background-color:#FFFFCC;font-size:16px;padding-top:10px;padding-bottom:10px;padding-left:10px}div.footer {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#666633;font-size:14px;-moz-border-radius-bottomleft: 10px;border-bottom-left-radius: 10px;-moz-border-radius-bottomright: 10px;border-bottom-right-radius: 10px;padding-left: 5px}BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}B {font-family:Tahoma,Arial,sans-serif;color:black;}A {color : black;}HR {color : #999966;}--></style> </head><body><div class="header">Request failed.</div><div class="body">Request failed.</div><div class="footer">Grizzly 3.0.0</div></body></html>
I guess this is the same issue but the error not occurs in the Log file.
I start the docker with the following command:
sudo docker run -it --rm -p 8080:8080 -h <ip adress> -v ~/.fscrawler:/root/.fscrawler -v ~/tmp:/tmp/es:ro dadoonet/fscrawler fscrawler documents --rest
Job Settings
the _setting.yaml (i played with indexed_chars but that does not help):
name: "documents"
fs:
url: "/tmp/es"
update_rate: "2m"
excludes:
- "*/~*"
json_support: false
indexed_chars: "-1"
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: true
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: true
continue_on_error: true
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "https://<url to elastic cloud>"
username: "elastic"
password: "<password of elastic>"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
ssl_verification: true
rest:
url: "http://0.0.0.0:8080/fscrawler"
enable_cors: true
Logs
09:58:57,762 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [2.6mb/233.9mb=1.15%], RAM [296.5mb/964.5mb=30.75%], Swap [0b/0b=0.0].
09:58:58,406 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
09:58:58,407 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
09:58:59,988 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.2.0
09:59:00,171 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.2.0
09:59:00,225 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [documents] for [/tmp/es] every [2m]
09:59:00,461 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
09:59:00,647 WARN [o.g.j.s.w.WadlFeature] JAXBContext implementation could not be found. WADL feature is disabled.
09:59:00,934 WARN [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi will be ignored.
09:59:00,937 WARN [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi will be ignored.
09:59:00,938 WARN [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi will be ignored.
09:59:01,710 INFO [f.p.e.c.f.r.RestServer] FS crawler Rest service started on [http://0.0.0.0:8080/fscrawler]
09:59:03,326 ERROR [o.a.p.c.PDFStreamEngine] Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
09:59:13,255 ERROR [o.a.p.c.PDFStreamEngine] Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Expected behavior
Read the PDF and put the text in Elastic
Versions:
- OS: AWS Linux
- Version 2.10-SNAPSHOT Docker
Attachment
If the bug is related to a given file, please share this file so we can reuse it in tests to reproduce the problem and may be use it in our integration tests.
I think it's because I made the libraries optionals. They should be part of the distribution I guess..
I need to check this more.
We were able to work around this by adding a lib that brings the JAI API/libs in.
In the distrubtion/pom.xml file we simply modified the langsPkg argument to look like the following:
<docker.ocr.args.langsPkg>imagemagick tesseract-ocr tesseract-ocr-all libpixelmed-imageio-java</docker.ocr.args.langsPkg>
@24601 Would you lik to contribute this change as a new PR?
@dadoonet - I could and would be happy to, but I believe the JPEG2000 libs have a license issue (I think you may have written about it in the docs?), and I would not want to submit a PR that may risk license issues and/or expose the project to certain liabilities. Both the JPEG2000 spec itself but also Oracle's various IP around the Java API which they've been really nasty with before. Do you have a position on this?
I would be happy to submit a PR that would document or provide a convenience script rather than automatically allowing this/including this that would include a warning to the user and acknowledgement. I'm not a lawyer, but this is perhaps the best course of action IMO.
I am also vetting the TwelveMonkeys library which is a JAI implementation with both TIFF and JPEG Lossless as a drop-in replacement, but it does not specifically list JPEG2000 support, only JPEG Lossless which is part of the 2000 spec but not the whole spec....so it may not pan out. Testing now.
Information: from the JJ2000 license:
Those intending to use this software module in hardware or software products are advised that their use may infringe existing patents. The original developers of this software module, JJ2000 Partners and ISO/IEC assume no liability for use of this software module or modifications thereof. No license or right to this software module is granted for non JPEG 2000 Standard conforming products.
I think we can provide another Docker image but under the original lib license. So users will choose if they can use it or not. WDYT?
default images under Apache2 Jpeg images under the new license
I'm not totally happy with this as this will generate even more images...
I think we can provide another Docker image but under the original lib license. So users will choose if they can use it or not. WDYT?
default images under Apache2 Jpeg images under the new license
I'm not totally happy with this as this will generate even more images...
That sounds good, I will vet which library seems to be the best (I have tested my solution above for 24 hrs but still seeing a few edge cases on other formats) and submit a PR hopefully early next week.
Merci!
An update on this, @dadoonet - I have a fix where the error is now just a warning and it works a bit better, but there are still some cases with JPX images in PDFs that I am trying to address with proper libraries.