files_fulltextsearch icon indicating copy to clipboard operation
files_fulltextsearch copied to clipboard

Index files other than text/document files

Open pacproduct opened this issue 6 years ago • 2 comments

Hi.

I'm far from grasping the complexity of ES and the NC's fulltextsearch suite, but: I thought that the Ingest Attachment Processor Plugin that we add to ElasticSearch aims at indexing virtually any known type of file, thanks to Apache Tika that knows how to parse hundreds and hundreds of file types.

Despite that, it seems to me like files_fulltextsearch provides ES with the content of files only when they match the following types: Text, Office, PDF.

And indeed, I've installed and configured files_fulltextsearch on a local NextCloud instance for tests purposes, and I don't seem to be able to search within the content of ZIP files, Image files, etc. Although Tika knows these file types.

Isn't it possible to just send all file contents to ES so it indexes as many file types as it can?

Thx.

pacproduct avatar Nov 13 '18 16:11 pacproduct

Does anyone know where the logic is that determines what files are indexed? I just took a quick look and couldn't find it. I noticed that markdown files aren't indexed, and I figured that one would be an easy fix (just treat it like a .txt), but I couldn't find the place where it reads file extensions.

tucker-m avatar Jan 13 '22 07:01 tucker-m

I am not sure if this is the right approach but here are some I found.

  • lib/Service/FilesService.php
        /**
         * @param string $mimeType
         * @param string $extension
         * @param string $parsed
         *
         * @throws KnownFileMimeTypeException
         */
        private function parseMimeTypeText(string $mimeType, string $extension, string &$parsed) {

                if (substr($mimeType, 0, 5) === 'text/') {
                        $parsed = self::MIMETYPE_TEXT;
                        throw new KnownFileMimeTypeException();
                }

                // 20220219 Parse XML files as TEXT files
                if (substr($mimeType, 0, 15) === 'application/xml') {
                        $parsed = self::MIMETYPE_TEXT;
                        throw new KnownFileMimeTypeException();
                }

                // 20220219 Parse .drawio file
                if ($extension  === 'drawio') {
                        $parsed = self::MIMETYPE_TEXT;
                        throw new KnownFileMimeTypeException();
                }

This way, application/xml and .drawio files are included for indexing.

.drawio files need a bit more extraction process for they are deflated xml.

Anyway, I have somehow done indexing .xml and .drawio files. If anyone is interested, I can push my branch.

My blog article on the issue is here

masahirominami avatar Feb 25 '22 15:02 masahirominami