files_fulltextsearch
files_fulltextsearch copied to clipboard
Exception FilesService calling `$document->getContent()` while indexing a Document
I have a new Debian 10 installation with Nextcloud 20.0.3.
Running occ fulltextsearch:index
leads to an unhandled exception:
An unhandled exception has been thrown:
Error: Call to a member function getContent() on string in /var/www/html/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php:719
Stack trace:
#0 /var/www/html/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(658): OCA\Files_FullTextSearch\Service\FilesService->updateContentFromFile('*** sensitive p...', Object(OC\Files\Node\File))
#1 /var/www/html/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(638): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocumentFromFile(Object(OCA\Files_FullTextSearch\Model\FilesDocument), Object(OC\Files\Node\File))
#2 /var/www/html/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(529): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#3 /var/www/html/nextcloud/apps/files_fulltextsearch/lib/Provider/FilesProvider.php(268): OCA\Files_FullTextSearch\Service\FilesService->generateDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#4 /var/www/html/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php(317): OCA\Files_FullTextSearch\Provider\FilesProvider->fillIndexDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#5 /var/www/html/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php(204): OCA\FullTextSearch\Service\IndexService->indexDocuments(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Array, Object(OCA\FullTextSearch\Model\IndexOptions))
#6 /var/www/html/nextcloud/apps/fulltextsearch/lib/Command/Index.php(410): OCA\FullTextSearch\Service\IndexService->indexProviderContentFromUser(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), 'Andy', Object(OCA\FullTextSearch\Model\IndexOptions))
#7 /var/www/html/nextcloud/apps/fulltextsearch/lib/Command/Index.php(273): OCA\FullTextSearch\Command\Index->indexProvider(Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Object(OCA\FullTextSearch\Model\IndexOptions))
#8 /var/www/html/nextcloud/3rdparty/symfony/console/Command/Command.php(255): OCA\FullTextSearch\Command\Index->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#9 /var/www/html/nextcloud/core/Command/Base.php(169): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#10 /var/www/html/nextcloud/3rdparty/symfony/console/Application.php(1000): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /var/www/html/nextcloud/3rdparty/symfony/console/Application.php(271): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Index), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /var/www/html/nextcloud/3rdparty/symfony/console/Application.php(147): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /var/www/html/nextcloud/lib/private/Console/Application.php(215): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#14 /var/www/html/nextcloud/console.php(100): OC\Console\Application->run()
#15 /var/www/html/nextcloud/occ(11): require_once('/var/www/html/n...')
Are there any possibilities to fix that error?
also with existing working 19.0.6 install updated to 20.0.3
I tried to debug it a bit, though i have no clue about PHP...
It seems that the varialbe $document
looses the object it was containting, and gets replaced with the string *** sensitive parameter replaced ***
. I think the only place where this can happen is somewhere here because you pass the document with &$document
.
With further investigation think that in this line an exception is thrown where $e->getMessages() shows:
FailedToExecuteCommand `'gs' -sstdout=%stderr -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 '-sDEVICE=pngalpha' -dTextAlphaBits=4 -dGraphicsAlphaBits=4 '-r72x72' '-sOutputFile=/tmp/magick-1272R2-pXVaqdBDj%d' '-f/tmp/magick-1272u4NCaxRdtnmc' '-f/tmp/magick-12721ZKLppFvcg54'' (1) @ error/pdf.c/InvokePDFDelegate/291
the code in catch
passed a $document and my guess is that this "replaces" the original $document with the string *** sensitive parameter replaced ***
This seems to work (for me) if you install ghostscript. Is this a new dependency?
apt install ghostscript
:D
I have ghostscript:9.27~dfsg-2+deb10u4 installed in the nextcloud image and /usr/bin/gs works and is in the path, but it continues to fail in the same way.
However, unchecking "Enable OCR" in Settings -> Files - Tesseract OCR does allow indexing to proceed without errors. (though obviously also without OCR)
Is there perhaps another new dependency for Tesseract, other than gs (which is installed).
@epvuc, you could add
file_put_contents("/tmp/mydebug.log", "caught exception: " . $e->getMessage() . "\n" , FILE_APPEND);
in the catch block in the file custom_apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php
(around Line 251), maybe you could figure it out yourself.
Please confirm you still have this issue with last version of files_fulltextsearch and files_fulltextsearch_tesseract ?
@daita do you mean without having ghostscript installed? or are you asking @epvuc for an confirmation?
I have files_fulltextsearch=20.0.0 and files_fulltextsearch_tesseract=20.0.1 which seem to be the latest versions and are the same as what's in git.
@ufobat I have already installed Ghostscript. This didn´t change anything. I added the additional log and it gave me this info:
caught exception: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/408
@daita I am on NextCloud 20.0.4 and the following FTS Versions, but the problem still exists
- Full text search 20.0.0
- Full text search - Elasticsearch Platform 20.0.0
- Full text search - Files 20.0.0
- Full text search - Files - Tesseract OCR 20.0.1
@ufobat With the additional logging I found a solution that worked for me. I had to add an addiotional policy for ImageMagick as mentioned here: https://stackoverflow.com/a/53180170/1254045
In my case the policy file was /etc/ImageMagick-7/policy.xml
where I added <policy domain="coder" rights="read | write" pattern="PDF" />
.
@ufobat With the additional logging I found a solution that worked for me. I had to add an addiotional policy for ImageMagick as mentioned here: https://stackoverflow.com/a/53180170/1254045 In my case the policy file was
/etc/ImageMagick-7/policy.xml
where I added<policy domain="coder" rights="read | write" pattern="PDF" />
.
This Imagemagick policy change allows indexing to work again for me, as well.
I think it will be important to know what security flaw in (presumably) ghostscript led to this being disallowed in ImageMagick's policy and in what gs version it was fixed, before making this change, though.
Uh, I had to have this policy active since i started to wort with the tesseract fulltestsearch.
People deploying Nextcloud via the official docker container on dockerhub will get a fresh copy of this file with every update inherited from the imagemagick package used to build the container.
I wonder if those experiencing this problem are all using the official Dockerhub container. If so, it may be that the container builder was updated to use a new version of the ImageMagick package which included the policy change we've observed. In this case the maintainer of the nextcloud docker image should be notified.
-- edit -- actually no, the docker image template doesn't install ImageMagick at all, it looks like that's up to the end user, so this would be also, unless someone wanted to add it to the official container.
so, looks like it is working with ghostscript and imagick with right policies ?
so, looks like it is working with ghostscript and imagick with right policies ?
For me it is, yes. Only change might be to have the indexing code gracefully handle the situation where ImageMagick returns "caught exception: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/408" and perhaps surface a useful error, as this will be up to the packager/user to handle.
Handling this policy refusal is also important because we could argue it's really not safe to feed arbitrary PDF files to ghostscript via ImageMagick at all, and a user might choose not to allow this. In this case we would want fulltextsearch-files-tesseract to still work for plain image formats and not break indexing entirely.
so, looks like it is working with ghostscript and imagick with right policies ?
In my case yes.
For me it is, yes. Only change might be to have the indexing code gracefully handle the situation where ImageMagick returns "caught exception: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/408" and perhaps surface a useful error, as this will be up to the packager/user to handle.
That would be very helpful. Either with a possibility to fix it or a short description how to resolve this problem.
Hi together,
I'm fighting with the same Error, but when I added the debug line of @ufobat, i find
caught exception: Failed to read the file
Maybe additionally the full log:
**** Error: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.
**** Ghostscript will attempt to recover the data.
**** However, the output may be incorrect.
**** Error: Trailer dictionary not found.
Output may be incorrect.
No pages will be processed (FirstPage > LastPage).
An unhandled exception has been thrown:
Error: Call to a member function getContent() on string in /srv/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php:719
Stack trace:
#0 /srv/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(658): OCA\Files_FullTextSearch\Service\FilesService->updateContentFromFile('*** sensitive p...', Object(OC\Files\Node\File))
#1 /srv/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(638): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocumentFromFile(Object(OCA\Files_FullTextSearch\Model\FilesDocument), Object(OC\Files\Node\File))
#2 /srv/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(529): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#3 /srv/www/nextcloud/apps/files_fulltextsearch/lib/Provider/FilesProvider.php(268): OCA\Files_FullTextSearch\Service\FilesService->generateDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#4 /srv/www/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php(317): OCA\Files_FullTextSearch\Provider\FilesProvider->fillIndexDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#5 /srv/www/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php(204): OCA\FullTextSearch\Service\IndexService->indexDocuments(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Array, Object(OCA\FullTextSearch\Model\IndexOptions))
#6 /srv/www/nextcloud/apps/fulltextsearch/lib/Command/Index.php(410): OCA\FullTextSearch\Service\IndexService->indexProviderContentFromUser(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), 'anni', Object(OCA\FullTextSearch\Model\IndexOptions))
#7 /srv/www/nextcloud/apps/fulltextsearch/lib/Command/Index.php(273): OCA\FullTextSearch\Command\Index->indexProvider(Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Object(OCA\FullTextSearch\Model\IndexOptions))
#8 /srv/www/nextcloud/apps/mail/vendor/symfony/console/Command/Command.php(258): OCA\FullTextSearch\Command\Index->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#9 /srv/www/nextcloud/core/Command/Base.php(169): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#10 /srv/www/nextcloud/apps/mail/vendor/symfony/console/Application.php(920): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /srv/www/nextcloud/apps/mail/vendor/symfony/console/Application.php(266): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Index), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /srv/www/nextcloud/apps/mail/vendor/symfony/console/Application.php(142): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /srv/www/nextcloud/lib/private/Console/Application.php(215): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#14 /srv/www/nextcloud/console.php(100): OC\Console\Application->run()
#15 /srv/www/nextcloud/occ(11): require_once('/srv/www/nextcl...')
I understand that the file might be broken, but any idea why the interface crashes?
I got the same error as @TheR00st3r. Disabling PDF in the Plugin-Menu fixed it for me. Also tried to install ghostscript, but still got the same error.
I'm using nextcloud-20
docker image (don't know exact version sorry) and the following helped. at least until container restart:
docker exec -i -t nextcloud bash
# root inside container
apt-get update
apt install ghostscript
apt install imagemagick
apt install nano
# edit <policy domain="coder" rights="read | write" pattern="PDF" />
nano /etc/ImageMagick-6/policy.xml
exit
docker exec -u 33 nextcloud php occ fulltextsearch:stop
docker exec -u 33 nextcloud php occ fulltextsearch:index
I can reproduce this issue very easily with a specific file of mine, and the trick with editing ImageMagick's policy.xml doesn't seem to help at all.
When I reproduce this, I see these errors:
**** Warning: File has some garbage before %PDF- .
**** Error: File did not complete the page properly and may be damaged.
Output may be incorrect.
**** Error reading a content stream. The page may be incomplete.
Output may be incorrect.
**** Error: File did not complete the page properly and may be damaged.
Output may be incorrect.
Error: /rangecheck in /--pdfshowpage_finish--
Operand stack:
--dict:7/15(L)-- --nostringval-- 9 32 --nostringval-- -1 --nostringval--
Execution stack:
%interp_exit .runexec2 --nostringval-- pdfshowpage_finish --nostringval-- 2 %stopped_push --nostringval-- pdfshowpage_finish pdfshowpage_finish false 1 %stopped_push 1974 1 3 %oparray_pop 1973 1 3 %oparray_pop 1961 1 3 %oparray_pop 1962 1 3 %oparray_pop pdfshowpage_finish pdfshowpage_finish 3 1 6 pdfshowpage_finish %for_pos_int_continue 1965 1 7 %oparray_pop pdfshowpage_finish pdfshowpage_finish
Dictionary stack:
--dict:744/1123(ro)(G)-- --dict:1/20(G)-- --dict:86/200(L)-- --dict:86/200(L)-- --dict:135/256(ro)(G)-- --dict:320/325(ro)(G)-- --dict:35/64(L)-- --dict:6/9(L)-- --dict:6/20(L)-- --dict:1/1(ro)(G)--
Current allocation mode is local
Last OS error: No such file or directory
GPL Ghostscript 9.52: Unrecoverable error, exit code 1
It looks like the $document
reference variable is messed with as soon as we catch an exception in TesseractService.php
's extractContentUsingTesseractOCR
. It becomes a string
of value *** sensitive parameter replaced ***
, as @ufobat mentioned here.
I'm no PHP expert by any stretch of imagination, but this looks like a memory corruption bug to me. I wonder if some kind of memory corruption occurs as soon as the Pdf
class constructor calls Imagick
's pingImage
method.
By the way, I'm also using Nextcloud 20.0.4 along with:
- Full text search 20.0.0
- Full text search - Elasticsearch Platform 20.0.0
- Full text search - Files 20.0.0
- Full text search - Files - Tesseract OCR 20.0.1
Here's a very rough patch that allowed me to get past that unhandled exception: files_fulltextsearch-workaround-document-corruption-caused-by-tesseract.diff.zip
Dang, I just ran into the same error and wrote down an issue at the wrong repository.
This error occurs if the file is corrupt - thus a simple fix would be deleting that file. Another workaround would be to simply skip that file and inform the user, that there has been an issue with indexing the file due to a possible corruption
@danielsteiner Fair enough, maybe the file is corrupt. But in my case, I double checked; it was a very simple utility bill in PDF format; it's perfectly viewable using a PDF viewer, and it's of value to me. So, as a user, I don't view "deleting that file" as an acceptable "fix"; I actually need to keep this document.
I think it's much preferable to find a way for files_fulltextsearch and its tesseract counterpart to handle such corrupted files gracefully.
Yeah I noticed that somewhen earlier today, after indexing and OCRing 11k files... I will try to fix that issue & submit a pull request. Shouldn’t be too hard
@danielsteiner Thank you so much for looking into this! Much appreciated!
I did this in FileService.php on my local install and it seem it works now.
if (gettype($document) != 'string') { if ($document->getContent() === null) { $document->getIndex() ->unsetStatus(IIndex::INDEX_CONTENT); } $this->updateCommentsFromFile($document); }
I check if the $document is not of type string and then step forward to the $document-getContent()
What does the solution to the following problem look like? So far, I have solved the custom modification of the code as described by the colleague above, but will the update be solved officially? Thanks
I'm encountering the same issue as initially described in this post. Neither the change of policy in the policy.xml file nor the installation of ghostscript seems to fix the problem. Are there any other solutions?