files_fulltextsearch icon indicating copy to clipboard operation
files_fulltextsearch copied to clipboard

Exception FilesService calling `$document->getContent()` while indexing a Document

Open TheR00st3r opened this issue 4 years ago • 35 comments

I have a new Debian 10 installation with Nextcloud 20.0.3.

Running occ fulltextsearch:index leads to an unhandled exception:

An unhandled exception has been thrown:
Error: Call to a member function getContent() on string in /var/www/html/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php:719
Stack trace:
#0 /var/www/html/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(658): OCA\Files_FullTextSearch\Service\FilesService->updateContentFromFile('*** sensitive p...', Object(OC\Files\Node\File))
#1 /var/www/html/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(638): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocumentFromFile(Object(OCA\Files_FullTextSearch\Model\FilesDocument), Object(OC\Files\Node\File))
#2 /var/www/html/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(529): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#3 /var/www/html/nextcloud/apps/files_fulltextsearch/lib/Provider/FilesProvider.php(268): OCA\Files_FullTextSearch\Service\FilesService->generateDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#4 /var/www/html/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php(317): OCA\Files_FullTextSearch\Provider\FilesProvider->fillIndexDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#5 /var/www/html/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php(204): OCA\FullTextSearch\Service\IndexService->indexDocuments(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Array, Object(OCA\FullTextSearch\Model\IndexOptions))
#6 /var/www/html/nextcloud/apps/fulltextsearch/lib/Command/Index.php(410): OCA\FullTextSearch\Service\IndexService->indexProviderContentFromUser(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), 'Andy', Object(OCA\FullTextSearch\Model\IndexOptions))
#7 /var/www/html/nextcloud/apps/fulltextsearch/lib/Command/Index.php(273): OCA\FullTextSearch\Command\Index->indexProvider(Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Object(OCA\FullTextSearch\Model\IndexOptions))
#8 /var/www/html/nextcloud/3rdparty/symfony/console/Command/Command.php(255): OCA\FullTextSearch\Command\Index->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#9 /var/www/html/nextcloud/core/Command/Base.php(169): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#10 /var/www/html/nextcloud/3rdparty/symfony/console/Application.php(1000): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /var/www/html/nextcloud/3rdparty/symfony/console/Application.php(271): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Index), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /var/www/html/nextcloud/3rdparty/symfony/console/Application.php(147): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /var/www/html/nextcloud/lib/private/Console/Application.php(215): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#14 /var/www/html/nextcloud/console.php(100): OC\Console\Application->run()
#15 /var/www/html/nextcloud/occ(11): require_once('/var/www/html/n...')

Are there any possibilities to fix that error?

TheR00st3r avatar Dec 14 '20 21:12 TheR00st3r

also with existing working 19.0.6 install updated to 20.0.3

epvuc avatar Dec 19 '20 18:12 epvuc

I tried to debug it a bit, though i have no clue about PHP... It seems that the varialbe $document looses the object it was containting, and gets replaced with the string *** sensitive parameter replaced ***. I think the only place where this can happen is somewhere here because you pass the document with &$document.

ufobat avatar Dec 19 '20 18:12 ufobat

With further investigation think that in this line an exception is thrown where $e->getMessages() shows:

FailedToExecuteCommand `'gs' -sstdout=%stderr -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 '-sDEVICE=pngalpha' -dTextAlphaBits=4 -dGraphicsAlphaBits=4 '-r72x72'  '-sOutputFile=/tmp/magick-1272R2-pXVaqdBDj%d' '-f/tmp/magick-1272u4NCaxRdtnmc' '-f/tmp/magick-12721ZKLppFvcg54'' (1) @ error/pdf.c/InvokePDFDelegate/291

the code in catch passed a $document and my guess is that this "replaces" the original $document with the string *** sensitive parameter replaced ***

ufobat avatar Dec 19 '20 19:12 ufobat

This seems to work (for me) if you install ghostscript. Is this a new dependency?

apt install ghostscript :D

ufobat avatar Dec 19 '20 20:12 ufobat

I have ghostscript:9.27~dfsg-2+deb10u4 installed in the nextcloud image and /usr/bin/gs works and is in the path, but it continues to fail in the same way.

However, unchecking "Enable OCR" in Settings -> Files - Tesseract OCR does allow indexing to proceed without errors. (though obviously also without OCR)

Is there perhaps another new dependency for Tesseract, other than gs (which is installed).

epvuc avatar Dec 20 '20 04:12 epvuc

@epvuc, you could add

file_put_contents("/tmp/mydebug.log", "caught exception: " . $e->getMessage() . "\n" , FILE_APPEND);

in the catch block in the file custom_apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php (around Line 251), maybe you could figure it out yourself.

ufobat avatar Dec 20 '20 07:12 ufobat

Please confirm you still have this issue with last version of files_fulltextsearch and files_fulltextsearch_tesseract ?

ArtificialOwl avatar Dec 20 '20 11:12 ArtificialOwl

@daita do you mean without having ghostscript installed? or are you asking @epvuc for an confirmation?

ufobat avatar Dec 20 '20 14:12 ufobat

I have files_fulltextsearch=20.0.0 and files_fulltextsearch_tesseract=20.0.1 which seem to be the latest versions and are the same as what's in git.

epvuc avatar Dec 20 '20 17:12 epvuc

@ufobat I have already installed Ghostscript. This didn´t change anything. I added the additional log and it gave me this info:

caught exception: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/408

@daita I am on NextCloud 20.0.4 and the following FTS Versions, but the problem still exists

  • Full text search 20.0.0
  • Full text search - Elasticsearch Platform 20.0.0
  • Full text search - Files 20.0.0
  • Full text search - Files - Tesseract OCR 20.0.1

TheR00st3r avatar Dec 20 '20 18:12 TheR00st3r

@ufobat With the additional logging I found a solution that worked for me. I had to add an addiotional policy for ImageMagick as mentioned here: https://stackoverflow.com/a/53180170/1254045 In my case the policy file was /etc/ImageMagick-7/policy.xml where I added <policy domain="coder" rights="read | write" pattern="PDF" />.

TheR00st3r avatar Dec 20 '20 18:12 TheR00st3r

@ufobat With the additional logging I found a solution that worked for me. I had to add an addiotional policy for ImageMagick as mentioned here: https://stackoverflow.com/a/53180170/1254045 In my case the policy file was /etc/ImageMagick-7/policy.xml where I added <policy domain="coder" rights="read | write" pattern="PDF" />.

This Imagemagick policy change allows indexing to work again for me, as well.

I think it will be important to know what security flaw in (presumably) ghostscript led to this being disallowed in ImageMagick's policy and in what gs version it was fixed, before making this change, though.

epvuc avatar Dec 20 '20 19:12 epvuc

Uh, I had to have this policy active since i started to wort with the tesseract fulltestsearch.

ufobat avatar Dec 20 '20 20:12 ufobat

People deploying Nextcloud via the official docker container on dockerhub will get a fresh copy of this file with every update inherited from the imagemagick package used to build the container.

I wonder if those experiencing this problem are all using the official Dockerhub container. If so, it may be that the container builder was updated to use a new version of the ImageMagick package which included the policy change we've observed. In this case the maintainer of the nextcloud docker image should be notified.

-- edit -- actually no, the docker image template doesn't install ImageMagick at all, it looks like that's up to the end user, so this would be also, unless someone wanted to add it to the official container.

epvuc avatar Dec 20 '20 20:12 epvuc

so, looks like it is working with ghostscript and imagick with right policies ?

ArtificialOwl avatar Dec 21 '20 09:12 ArtificialOwl

so, looks like it is working with ghostscript and imagick with right policies ?

For me it is, yes. Only change might be to have the indexing code gracefully handle the situation where ImageMagick returns "caught exception: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/408" and perhaps surface a useful error, as this will be up to the packager/user to handle.

Handling this policy refusal is also important because we could argue it's really not safe to feed arbitrary PDF files to ghostscript via ImageMagick at all, and a user might choose not to allow this. In this case we would want fulltextsearch-files-tesseract to still work for plain image formats and not break indexing entirely.

epvuc avatar Dec 21 '20 17:12 epvuc

so, looks like it is working with ghostscript and imagick with right policies ?

In my case yes.

For me it is, yes. Only change might be to have the indexing code gracefully handle the situation where ImageMagick returns "caught exception: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/408" and perhaps surface a useful error, as this will be up to the packager/user to handle.

That would be very helpful. Either with a possibility to fix it or a short description how to resolve this problem.

TheR00st3r avatar Dec 21 '20 18:12 TheR00st3r

Hi together, I'm fighting with the same Error, but when I added the debug line of @ufobat, i find caught exception: Failed to read the file Maybe additionally the full log:

   **** Error:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
   **** However, the output may be incorrect.
   **** Error:  Trailer dictionary not found.
                Output may be incorrect.
   No pages will be processed (FirstPage > LastPage).
An unhandled exception has been thrown:
Error: Call to a member function getContent() on string in /srv/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php:719
Stack trace:
#0 /srv/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(658): OCA\Files_FullTextSearch\Service\FilesService->updateContentFromFile('*** sensitive p...', Object(OC\Files\Node\File))
#1 /srv/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(638): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocumentFromFile(Object(OCA\Files_FullTextSearch\Model\FilesDocument), Object(OC\Files\Node\File))
#2 /srv/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(529): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#3 /srv/www/nextcloud/apps/files_fulltextsearch/lib/Provider/FilesProvider.php(268): OCA\Files_FullTextSearch\Service\FilesService->generateDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#4 /srv/www/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php(317): OCA\Files_FullTextSearch\Provider\FilesProvider->fillIndexDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#5 /srv/www/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php(204): OCA\FullTextSearch\Service\IndexService->indexDocuments(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Array, Object(OCA\FullTextSearch\Model\IndexOptions))
#6 /srv/www/nextcloud/apps/fulltextsearch/lib/Command/Index.php(410): OCA\FullTextSearch\Service\IndexService->indexProviderContentFromUser(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), 'anni', Object(OCA\FullTextSearch\Model\IndexOptions))
#7 /srv/www/nextcloud/apps/fulltextsearch/lib/Command/Index.php(273): OCA\FullTextSearch\Command\Index->indexProvider(Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Object(OCA\FullTextSearch\Model\IndexOptions))
#8 /srv/www/nextcloud/apps/mail/vendor/symfony/console/Command/Command.php(258): OCA\FullTextSearch\Command\Index->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#9 /srv/www/nextcloud/core/Command/Base.php(169): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#10 /srv/www/nextcloud/apps/mail/vendor/symfony/console/Application.php(920): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /srv/www/nextcloud/apps/mail/vendor/symfony/console/Application.php(266): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Index), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /srv/www/nextcloud/apps/mail/vendor/symfony/console/Application.php(142): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /srv/www/nextcloud/lib/private/Console/Application.php(215): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#14 /srv/www/nextcloud/console.php(100): OC\Console\Application->run()
#15 /srv/www/nextcloud/occ(11): require_once('/srv/www/nextcl...')

I understand that the file might be broken, but any idea why the interface crashes?

tacruc avatar Dec 22 '20 08:12 tacruc

I got the same error as @TheR00st3r. Disabling PDF in the Plugin-Menu fixed it for me. Also tried to install ghostscript, but still got the same error.

R7e98kva avatar Dec 30 '20 17:12 R7e98kva

I'm using nextcloud-20 docker image (don't know exact version sorry) and the following helped. at least until container restart:

docker exec -i -t nextcloud bash
# root inside container
apt-get update
apt install ghostscript
apt install imagemagick
apt install nano
# edit <policy domain="coder" rights="read | write" pattern="PDF" />
nano /etc/ImageMagick-6/policy.xml
exit

docker exec -u 33 nextcloud php occ fulltextsearch:stop
docker exec -u 33 nextcloud php occ fulltextsearch:index

d-rk avatar Dec 31 '20 10:12 d-rk

I can reproduce this issue very easily with a specific file of mine, and the trick with editing ImageMagick's policy.xml doesn't seem to help at all.

When I reproduce this, I see these errors:

   **** Warning: File has some garbage before %PDF- .
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.
Error: /rangecheck in /--pdfshowpage_finish--
Operand stack:
   --dict:7/15(L)--   --nostringval--   9   32   --nostringval--   -1   --nostringval--
Execution stack:
   %interp_exit   .runexec2   --nostringval--   pdfshowpage_finish   --nostringval--   2   %stopped_push   --nostringval--   pdfshowpage_finish   pdfshowpage_finish   false   1   %stopped_push   1974   1   3   %oparray_pop   1973   1   3   %oparray_pop   1961   1   3   %oparray_pop   1962   1   3   %oparray_pop   pdfshowpage_finish   pdfshowpage_finish   3   1   6   pdfshowpage_finish   %for_pos_int_continue   1965   1   7   %oparray_pop   pdfshowpage_finish   pdfshowpage_finish
Dictionary stack:
   --dict:744/1123(ro)(G)--   --dict:1/20(G)--   --dict:86/200(L)--   --dict:86/200(L)--   --dict:135/256(ro)(G)--   --dict:320/325(ro)(G)--   --dict:35/64(L)--   --dict:6/9(L)--   --dict:6/20(L)--   --dict:1/1(ro)(G)--
Current allocation mode is local
Last OS error: No such file or directory
GPL Ghostscript 9.52: Unrecoverable error, exit code 1

It looks like the $document reference variable is messed with as soon as we catch an exception in TesseractService.php's extractContentUsingTesseractOCR. It becomes a string of value *** sensitive parameter replaced ***, as @ufobat mentioned here.

JoeKun avatar Jan 01 '21 00:01 JoeKun

I'm no PHP expert by any stretch of imagination, but this looks like a memory corruption bug to me. I wonder if some kind of memory corruption occurs as soon as the Pdf class constructor calls Imagick's pingImage method.

By the way, I'm also using Nextcloud 20.0.4 along with:

  • Full text search 20.0.0
  • Full text search - Elasticsearch Platform 20.0.0
  • Full text search - Files 20.0.0
  • Full text search - Files - Tesseract OCR 20.0.1

JoeKun avatar Jan 01 '21 00:01 JoeKun

Here's a very rough patch that allowed me to get past that unhandled exception: files_fulltextsearch-workaround-document-corruption-caused-by-tesseract.diff.zip

JoeKun avatar Jan 01 '21 01:01 JoeKun

Dang, I just ran into the same error and wrote down an issue at the wrong repository.

This error occurs if the file is corrupt - thus a simple fix would be deleting that file. Another workaround would be to simply skip that file and inform the user, that there has been an issue with indexing the file due to a possible corruption

danielsteiner avatar Jan 07 '21 05:01 danielsteiner

@danielsteiner Fair enough, maybe the file is corrupt. But in my case, I double checked; it was a very simple utility bill in PDF format; it's perfectly viewable using a PDF viewer, and it's of value to me. So, as a user, I don't view "deleting that file" as an acceptable "fix"; I actually need to keep this document.

I think it's much preferable to find a way for files_fulltextsearch and its tesseract counterpart to handle such corrupted files gracefully.

JoeKun avatar Jan 07 '21 17:01 JoeKun

Yeah I noticed that somewhen earlier today, after indexing and OCRing 11k files... I will try to fix that issue & submit a pull request. Shouldn’t be too hard

danielsteiner avatar Jan 07 '21 17:01 danielsteiner

@danielsteiner Thank you so much for looking into this! Much appreciated!

JoeKun avatar Jan 07 '21 17:01 JoeKun

I did this in FileService.php on my local install and it seem it works now. if (gettype($document) != 'string') { if ($document->getContent() === null) { $document->getIndex() ->unsetStatus(IIndex::INDEX_CONTENT); } $this->updateCommentsFromFile($document); } I check if the $document is not of type string and then step forward to the $document-getContent()

Telmur avatar May 06 '21 15:05 Telmur

What does the solution to the following problem look like? So far, I have solved the custom modification of the code as described by the colleague above, but will the update be solved officially? Thanks

tomasmark79 avatar May 18 '21 07:05 tomasmark79

I'm encountering the same issue as initially described in this post. Neither the change of policy in the policy.xml file nor the installation of ghostscript seems to fix the problem. Are there any other solutions?

Ourithi avatar May 26 '21 07:05 Ourithi