Add support for other text formats in the file content filter

Open dpomykala opened this issue 3 months ago • 0 comments

Is your feature request related to a problem? Please describe. The file content filter currently supports the following file formats: .md, .txt, .log, .pdf, and .docx. From these, only PDF and DOCX files require specialized extraction logic. The other three are treated as plain text (just reading the content of a file).

The problem is, any other text file that happens to have a different extension cannot be processed, as it is not listed here: https://github.com/tfeldmann/organize/blob/ac520341a639a0bed6c55fd0c13604fcf927b666/organize/filters/filecontent.py#L82-L88

There could be a lot of text files with sometimes weird extensions and even without an extension at all. There is no reason to exclude those files from processing.

E.g., I was trying to process an XML file by searching for a specific pattern, but this is currently not possible.

Describe the solution you'd like We could use a specialized extractor if it is registered for a given file format (currently PDF and DOCX) and use a simple text extractor (extract_txt) as a fallback for all other files: https://github.com/tfeldmann/organize/blob/ac520341a639a0bed6c55fd0c13604fcf927b666/organize/filters/filecontent.py#L37-L38

I can provide a PR if this solution is accepted.

Sep 14 '25 15:09 dpomykala