fulltextsearch
fulltextsearch copied to clipboard
java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] with large pdf files
php occ fulltextsearch:index
stops indexing at pdf files with
Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException │ Message: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
I deleted the first pdf where indexing stopped, started the indexing command again. fulltextsearch indexing stalled again on a pdf file. And again after deleting this one too.
Common pattern: all pdf files were larger than 70 Mbyte.
Elasticsearch is running with 8 GB of RAM:
Active: active (running) since Mon 2018-10-08 16:14:13 CEST; 1h 10min ago Docs: http://www.elastic.co Main PID: 504 (java) CGroup: /system.slice/elasticsearch.service |-504 /bin/java -Xms8g -Xmx8g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava... `-807 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller
Latest apps installed (1.01) and configured.
Any hints on this? I would love to use fulltextsearch on my files...
Just for the record: I am running NC 14
$ php occ status
- installed: true
- version: 14.0.1.1
- versionstring: 14.0.1
- edition:
have you run some test before the first index ?
Yes I did:
$ php occ fulltextsearch:test
.Testing your current setup:
Creating mocked content provider. ok
Testing mocked provider: get indexable documents. (2 items) ok
Loading search platform. (Elasticsearch) ok
Testing search platform. ok
Locking process ok
Removing test. ok
Pausing 3 seconds 1 2 3 ok
Initializing index mapping. ok
Indexing generated documents. ok
Pausing 3 seconds 1 2 3 ok
Retreiving content from a big index (license). (size: 32386) ok
Comparing document with source. ok
Searching basic keywords:
- 'test' (result: 1, expected: ["simple"]) ok
- 'document is a simple test' (result: 2, expected: ["simple","license"]) ok
- '"document is a test"' (result: 0, expected: []) ok
- '"document is a simple test"' (result: 1, expected: ["simple"]) ok
- 'document is a simple -test' (result: 1, expected: ["license"]) ok
- 'document is a simple +test' (result: 1, expected: ["simple"]) ok
- '-document is a simple test' (result: 0, expected: []) ok
Updating documents access. ok
Pausing 3 seconds 1 2 3 ok
Searching with group access rights: - 'license' - [] - (result: 0, expected: []) ok
- 'license' - ["group_1"] - (result: 1, expected: ["license"]) ok
- 'license' - ["group_1","group_2"] - (result: 1, expected: ["license"]) ok
- 'license' - ["group_3","group_2"] - (result: 1, expected: ["license"]) ok
- 'license' - ["group_3"] - (result: 0, expected: []) ok
Searching with share rights: - 'license' - notuser - (result: 0, expected: []) ok
- 'license' - user2 - (result: 1, expected: ["license"]) ok
- 'license' - user3 - (result: 1, expected: ["license"]) ok
Removing test. ok
Unlocking process ok
can you reset, test and re-index ?
./occ fulltextsearch:reset
./occ fulltextsearch:test
./occ fulltextsearch:index
I also did some test on my side, and elasticsearch returns me an error 'Request Entity Too Large'. I would say that pdf files bigger than ~70MB will not be indexed by elasticsearch
Would you send me the pdf that crash your index ?
@reset | test | index: I did this before twice, no luck.
@PDF: Sure, please check your private mail.
@tomthecat using '@'s is how you contact people on Github, please be a little careful with them.
@tomthecat haven't receive any email, if you host the file, can you send me the link to [email protected] ?
daita: Did you receive my mail?
yup, if we're talking about a 170MB pdf ? :-)
What do you think: is there a chance to make fulltextsearch skip these files and not to abort?
should be fixed in 1.0.2
Updated to 1.0.2. Seems a bit better now, but: I still receive the
Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException │ Message: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]
error at very large PDF files and also at a large PPT. See PM for a link to these files.
The error I have is a BadRequest400Exception from elasticsearch, which is typical for pdf file bigger than 70MB.
Have you change anything to the configuration of your ES ?
Nope. I followed the instructions given here: https://fribeiro.org/tech/2018/02/07/nextcloud-full-text-elasticsearch/ and here: https://decatec.de/home-server/volltextsuche-in-nextcloud-mit-ocr/ (for tesseract OCR) without any additional tweaking.
I get the same error with a PDF of ~3 MB but around 130 pages. I can also send you the pdf if needed.
I have the same error. Is it possible to configure fulltextsearch to skip these pdfs? We have a lot of pdfs bigger than 70Mb...