Add support for other text formats in the file content filter
Is your feature request related to a problem? Please describe. The file content filter currently supports the following file formats: .md, .txt, .log, .pdf, and .docx. From these, only PDF and DOCX files require specialized extraction logic. The other three are treated as plain text (just reading the content of a file).
The problem is, any other text file that happens to have a different extension cannot be processed, as it is not listed here: https://github.com/tfeldmann/organize/blob/ac520341a639a0bed6c55fd0c13604fcf927b666/organize/filters/filecontent.py#L82-L88
There could be a lot of text files with sometimes weird extensions and even without an extension at all. There is no reason to exclude those files from processing.
E.g., I was trying to process an XML file by searching for a specific pattern, but this is currently not possible.
Describe the solution you'd like
We could use a specialized extractor if it is registered for a given file format (currently PDF and DOCX) and use a simple text extractor (extract_txt) as a fallback for all other files:
https://github.com/tfeldmann/organize/blob/ac520341a639a0bed6c55fd0c13604fcf927b666/organize/filters/filecontent.py#L37-L38
I can provide a PR if this solution is accepted.