fulltextsearch java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] with large pdf files

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] with large pdf files

Open tomthecat opened this issue 5 years ago • 17 comments

php occ fulltextsearch:index stops indexing at pdf files with

Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException │ Message: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]

I deleted the first pdf where indexing stopped, started the indexing command again. fulltextsearch indexing stalled again on a pdf file. And again after deleting this one too.

Common pattern: all pdf files were larger than 70 Mbyte.

Elasticsearch is running with 8 GB of RAM:

Active: active (running) since Mon 2018-10-08 16:14:13 CEST; 1h 10min ago Docs: http://www.elastic.co Main PID: 504 (java) CGroup: /system.slice/elasticsearch.service |-504 /bin/java -Xms8g -Xmx8g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava... `-807 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller

Latest apps installed (1.01) and configured.

Any hints on this? I would love to use fulltextsearch on my files...

Oct 08 '18 15:10 tomthecat

Just for the record: I am running NC 14

$ php occ status

installed: true
version: 14.0.1.1
versionstring: 14.0.1
edition:

Oct 11 '18 07:10 tomthecat

have you run some test before the first index ?

Oct 11 '18 08:10 ArtificialOwl

Yes I did:

$ php occ fulltextsearch:test

.Testing your current setup:
Creating mocked content provider. ok
Testing mocked provider: get indexable documents. (2 items) ok
Loading search platform. (Elasticsearch) ok
Testing search platform. ok
Locking process ok
Removing test. ok
Pausing 3 seconds 1 2 3 ok
Initializing index mapping. ok
Indexing generated documents. ok
Pausing 3 seconds 1 2 3 ok
Retreiving content from a big index (license). (size: 32386) ok
Comparing document with source. ok
Searching basic keywords:

'test' (result: 1, expected: ["simple"]) ok
'document is a simple test' (result: 2, expected: ["simple","license"]) ok
'"document is a test"' (result: 0, expected: []) ok
'"document is a simple test"' (result: 1, expected: ["simple"]) ok
'document is a simple -test' (result: 1, expected: ["license"]) ok
'document is a simple +test' (result: 1, expected: ["simple"]) ok
'-document is a simple test' (result: 0, expected: []) ok
Updating documents access. ok
Pausing 3 seconds 1 2 3 ok
Searching with group access rights:
'license' - [] - (result: 0, expected: []) ok
'license' - ["group_1"] - (result: 1, expected: ["license"]) ok
'license' - ["group_1","group_2"] - (result: 1, expected: ["license"]) ok
'license' - ["group_3","group_2"] - (result: 1, expected: ["license"]) ok
'license' - ["group_3"] - (result: 0, expected: []) ok
Searching with share rights:
'license' - notuser - (result: 0, expected: []) ok
'license' - user2 - (result: 1, expected: ["license"]) ok
'license' - user3 - (result: 1, expected: ["license"]) ok
Removing test. ok
Unlocking process ok

Oct 11 '18 08:10 tomthecat

can you reset, test and re-index ?

./occ fulltextsearch:reset
./occ fulltextsearch:test
./occ fulltextsearch:index

Oct 11 '18 12:10 ArtificialOwl

I also did some test on my side, and elasticsearch returns me an error 'Request Entity Too Large'. I would say that pdf files bigger than ~70MB will not be indexed by elasticsearch

Would you send me the pdf that crash your index ?

Oct 11 '18 13:10 ArtificialOwl

@reset | test | index: I did this before twice, no luck.

@PDF: Sure, please check your private mail.

Oct 11 '18 15:10 tomthecat

@tomthecat using '@'s is how you contact people on Github, please be a little careful with them.

Oct 11 '18 21:10 pdf

@tomthecat haven't receive any email, if you host the file, can you send me the link to [email protected] ?

Oct 12 '18 06:10 ArtificialOwl

daita: Did you receive my mail?

Oct 15 '18 19:10 tomthecat

yup, if we're talking about a 170MB pdf ? :-)

Oct 16 '18 11:10 ArtificialOwl

What do you think: is there a chance to make fulltextsearch skip these files and not to abort?

Oct 16 '18 12:10 tomthecat

should be fixed in 1.0.2

Oct 19 '18 06:10 ArtificialOwl

Updated to 1.0.2. Seems a bit better now, but: I still receive the

Exception: Elasticsearch\Common\Exceptions\ServerErrorResponseException │ Message: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content]

error at very large PDF files and also at a large PPT. See PM for a link to these files.

Oct 22 '18 08:10 tomthecat

The error I have is a BadRequest400Exception from elasticsearch, which is typical for pdf file bigger than 70MB.

Have you change anything to the configuration of your ES ?

Oct 23 '18 12:10 ArtificialOwl

Nope. I followed the instructions given here: https://fribeiro.org/tech/2018/02/07/nextcloud-full-text-elasticsearch/ and here: https://decatec.de/home-server/volltextsuche-in-nextcloud-mit-ocr/ (for tesseract OCR) without any additional tweaking.

Oct 24 '18 07:10 tomthecat

I get the same error with a PDF of ~3 MB but around 130 pages. I can also send you the pdf if needed.

Aug 24 '19 07:08 Agraphie

I have the same error. Is it possible to configure fulltextsearch to skip these pdfs? We have a lot of pdfs bigger than 70Mb...

Feb 19 '20 07:02 aiveras

fulltextsearch fulltextsearch copied to clipboard

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [attachment.content] with large pdf files

fulltextsearch
fulltextsearch copied to clipboard