suricata
suricata copied to clipboard
Rust mimetype
This patchset proposes an alternative to filemagic as it has multiple issues:
- filemagic is known for bad performance
- database is not consistent between system
- results are sometime too detailed (do we care about size of image)
This patchet uses a rust crate to extract the mime type instead of the magic:
- it is really faster than filemagic
- mimetype database is included in the crate
- mimetype are easier to predict
It thus fixes most of the issues present with filemagic and thus is a nice alternative.
- [x] I have read the contributing guide lines at https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Contributing
- [x] I have signed the Open Information Security Foundation contribution agreement at https://suricata.io/about/contribution-agreement/
- [x] I have updated the user guide (in doc/userguide/) to reflect the changes made (if applicable)
Link to redmine ticket:
Describe changes:
- add mimetype computation
- add mimetype to fileinfo
- add file.mimetype keyword
WARNING:
field | baseline | test | % |
---|---|---|---|
build_asan |
Pipeline 8992
The crate repo shows limited signs of activity, with the last commit about a year ago? https://github.com/mbrubeck/tree_magic/ @jasonish any thoughts on the crate?
Moved to draft as it needs some serious patching up as CI shows.
Is the database included, from the tree_magic readme:
By default, tree_magic_mini will attempt to load the shared MIME info database from the standard locations at runtime.
Embedding the database requires the tree_magic_db
feature, but the database is GPL and warned about in the README. Without embedding it, we still have the issue of inconsistency, or the magic file not even existing on a system.
The performance increase would be nice.
I've always thought we should roll this ourselves, with a limit magic database. I think Snort does this. Even with this crate, we'd probably need to roll our own database due to the GPL.
Just some followup... In the current version of this PR I verified that the embedded/bundled magic database is not being used. Instead if checks the following paths:
"/usr/share/mime/magic",
"/usr/local/share/mime/magic",
"/$HOME/.local/share/mime/magic",
If it can't load magic from those file, it does have some builtin's.. For example, it will detect a plain text file as text/plain
, or a PDF as application/octet-stream
, but with magic databases it would use the pdf mimetype.
The questionable files are the magic database, which come from freedesktop.org shared-mime-info project which is licensed with the GPLv2+, https://freedesktop.org/wiki/Software/shared-mime-info/, but being data files and not code, I'm not sure what this means.
If we were to include these data files, but not embed them in the binary, then we would need to modify tree_magic_mini
to support loading the database from arbitrary locations, as it currently uses a hardcoded list.
I think there is some tricky licensing details with mime databases here. https://github.com/zRedShift/mimemagic/ is a Go project for mime-sniffing. That project parsed the raw XML data from shared-mime-info
and created Go files which it then links in, then was subject to a DMCA takedown notice by one of the contributors to shared-mime-info
: https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues/154.
I think enabling the embed feature, while extremely useful, is probably not a good idea. Shipping the data as files, is also questionable, being GPL code we don't have control over.
We could do ./configure and run time startup checks to see if shared-mime-info
is installed and display warnings.
Or maybe we should look at something without any license issues. I found https://github.com/bojand/infer which supports a more limited set of file types. Its MIT license, no extra files, its all in the code.
Utlmately I think we want something like this PR suggests. I'd be favour of removing magic if we had a consistent mime type sniffer across all platforms we support and did not rely on the runtime environment for data files.
Should we make the build only if an enable-flag is passed ? This would avoid the GPL issue.
Should we make the build only if an enable-flag is passed ? This would avoid the GPL issue.
I'm not sure I like the idea of something I see as an obvious enhancement over an existing feature being gated by a license flag.
Ooops, didn't mean to close.
Just to give you an idea on the impact, we did some performance testing with that feature compared to file-magic:
34 Gbit/s Traffic on 40G with baseline t-rex traffic an 4.19 kernel after 10min running:
force-mimetype: false
and file-magic: false
4.0 % Drop
force-mimetype: true
and file-magic: false
4.1 % Drop
force-mimetype: false
and file-magic: true
43 % Drop
So for setups that use this a lot it could be a big improvement.
Just to give you an idea on the impact, we did some performance testing with that feature compared to file-magic:
34 Gbit/s Traffic on 40G with baseline t-rex traffic an 4.19 kernel after 10min running:
force-mimetype: false
andfile-magic: false
4.0 % Drop
force-mimetype: true
andfile-magic: false
4.1 % Drop
force-mimetype: false
andfile-magic: true
43 % DropSo for setups that use this a lot it could be a big improvement.
Nice performance numbers. I think in the not to long term it would be good to drop the magic support and just look at the mime-type, however I'm not sure we can proceed with this library given the license restrictions. As it would be nice to provide a "consistent" mime database across all installs, and not depend on what the user has installed.