Regex for exclusions
sist2 version: 3.1.4
Platform (Linux or Docker, x86-64 or arm64): Docker
Elasticsearch version: 7.7.19
It would be great to get some common regexes that might be used for exclusions.
I'm using this one in a couple of places where I just want to index PDFs, ebooks, and other text-based files:
.*.(jpg|jpeg|tiff|JPEG|png|PNG|webm|webp|sql)
And another to cleanly index a calibre directory
(.*.(jpg|jpeg|tiff|JPEG|png|opf))
And one to exclude a folder that has pdf.js inside it that I don't need indexed.
|\/pdf\/.*
I am still looking for a regex to exclude dotfiles. Nothing I've tried so far has worked.
I've created a series of components that target different file types. You can join these statments together with an OR (|) in order to create the exact exclusion list you want for a given job.
Compressed files: (?i:^.*.(zip|rar|7z|tar|gz|gzip)$)
Videos (?i:^.*.(mp4|avi|wmv|mov|flv|mkv|webm|vob|ogv|m4v|3gp|3g2|mpeg|mpg|m2v|m4v|svi|3gpp|3gpp2|mxf|roq|nsv|flv|f4v|f4p|f4a|f4b)$)
Audio (?i:^.*.(mp3|wav|wma|aac|flac|ogg|m4a|aiff|alac|amr|ape|au|mpc|tta|wv|opus)$)
Images (?i:^.*.(jpg|jpeg|png|gif|bmp|tiff|psd|raw|cr2|nef|orf|sr2)$)
Presentations (?i:^.*.(ppt|pptx|pps|ppsx|odp|fodp|otp|key)$)
Spreadsheets (?i:^.*.(xls|xlsx|csv|ods|fods|ots|gnumeric|numbers)$)
Documents (?i:^.*.(doc|docx|pdf|txt|rtf|odt|wps|wpd|pages)$)
Web Files (?i:^.*.(js|html|css)$)
dotfiles ^/?(?:\w+/)*(.\w+))
An Example: Exclude dotfiles, images, compressed files, and web files: (?i:^..(zip|rar|7z|tar|gz|gzip)$)|(^/?(?:\w+/)(.\w+))|(?i:^..(jpg|jpeg|png|gif|bmp|tiff|psd|raw|cr2|nef|orf|sr2)$)|(?i:^..(js|html|css)$)
ToDo
- Expand on Webfiles (py, cgi, etc.)
- Create a list of text files (txt, md, etc.)
- Add regex to exclude entire subdirectories (if possible)
Questions
- Should this regex be more optimized?
- Will the "componentized" nature of it create higher overhead or be more process intensive?
Came here on a whim to search open/closed issues for previous discussion about regex exclusions, formats etc. Thank you @rickcecil for the above, very grateful to you for putting these here as I was struggling to figure this out since I've never been able to get my brain to comprehend regex sufficiently to do it from scratch.
Awesome, glad it is helpful! I think the biggest thing now is just gathering all the different types of file extensions that you might want to exclude. For example, I realize that I've missed webm and webp. Also, log files. Post any extensions here that you come across that, when excluded, give a cleaner set of data... and I'll update the regex.
Is it possible to only include a certain filetype instead? E.g. only index all .eml files
Is it possible to only include a certain filetype instead? E.g. only index all .eml files
Something like this maybe
^(?!.*eml).*$
But you would want to test it out on some file names (copy the entire path and the file name). I like to use this site to test regex https://regex101.com/