files_fulltextsearch
files_fulltextsearch copied to clipboard
Index files other than text/document files
Hi.
I'm far from grasping the complexity of ES and the NC's fulltextsearch
suite, but: I thought that the Ingest Attachment Processor Plugin
that we add to ElasticSearch aims at indexing virtually any known type of file, thanks to Apache Tika
that knows how to parse hundreds and hundreds of file types.
Despite that, it seems to me like files_fulltextsearch
provides ES with the content of files only when they match the following types: Text, Office, PDF.
And indeed, I've installed and configured files_fulltextsearch
on a local NextCloud instance for tests purposes, and I don't seem to be able to search within the content of ZIP files, Image files, etc. Although Tika
knows these file types.
Isn't it possible to just send all file contents to ES so it indexes as many file types as it can?
Thx.
Does anyone know where the logic is that determines what files are indexed? I just took a quick look and couldn't find it. I noticed that markdown files aren't indexed, and I figured that one would be an easy fix (just treat it like a .txt), but I couldn't find the place where it reads file extensions.
I am not sure if this is the right approach but here are some I found.
- lib/Service/FilesService.php
/**
* @param string $mimeType
* @param string $extension
* @param string $parsed
*
* @throws KnownFileMimeTypeException
*/
private function parseMimeTypeText(string $mimeType, string $extension, string &$parsed) {
if (substr($mimeType, 0, 5) === 'text/') {
$parsed = self::MIMETYPE_TEXT;
throw new KnownFileMimeTypeException();
}
// 20220219 Parse XML files as TEXT files
if (substr($mimeType, 0, 15) === 'application/xml') {
$parsed = self::MIMETYPE_TEXT;
throw new KnownFileMimeTypeException();
}
// 20220219 Parse .drawio file
if ($extension === 'drawio') {
$parsed = self::MIMETYPE_TEXT;
throw new KnownFileMimeTypeException();
}
This way, application/xml and .drawio files are included for indexing.
.drawio files need a bit more extraction process for they are deflated xml.
Anyway, I have somehow done indexing .xml and .drawio files. If anyone is interested, I can push my branch.
My blog article on the issue is here