suricata icon indicating copy to clipboard operation
suricata copied to clipboard

Rust mimetype

Open regit opened this issue 1 year ago • 6 comments

This patchset proposes an alternative to filemagic as it has multiple issues:

  • filemagic is known for bad performance
  • database is not consistent between system
  • results are sometime too detailed (do we care about size of image)

This patchet uses a rust crate to extract the mime type instead of the magic:

  • it is really faster than filemagic
  • mimetype database is included in the crate
  • mimetype are easier to predict

It thus fixes most of the issues present with filemagic and thus is a nice alternative.

  • [x] I have read the contributing guide lines at https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Contributing
  • [x] I have signed the Open Information Security Foundation contribution agreement at https://suricata.io/about/contribution-agreement/
  • [x] I have updated the user guide (in doc/userguide/) to reflect the changes made (if applicable)

Link to redmine ticket:

Describe changes:

  • add mimetype computation
  • add mimetype to fileinfo
  • add file.mimetype keyword

regit avatar Sep 02 '22 19:09 regit

WARNING:

field baseline test %
build_asan

Pipeline 8992

suricata-qa avatar Sep 02 '22 20:09 suricata-qa

The crate repo shows limited signs of activity, with the last commit about a year ago? https://github.com/mbrubeck/tree_magic/ @jasonish any thoughts on the crate?

victorjulien avatar Sep 09 '22 09:09 victorjulien

Moved to draft as it needs some serious patching up as CI shows.

victorjulien avatar Sep 09 '22 09:09 victorjulien

Is the database included, from the tree_magic readme:

By default, tree_magic_mini will attempt to load the shared MIME info database from the standard locations at runtime.

Embedding the database requires the tree_magic_db feature, but the database is GPL and warned about in the README. Without embedding it, we still have the issue of inconsistency, or the magic file not even existing on a system.

The performance increase would be nice.

I've always thought we should roll this ourselves, with a limit magic database. I think Snort does this. Even with this crate, we'd probably need to roll our own database due to the GPL.

jasonish avatar Sep 09 '22 15:09 jasonish

Just some followup... In the current version of this PR I verified that the embedded/bundled magic database is not being used. Instead if checks the following paths:

"/usr/share/mime/magic",
"/usr/local/share/mime/magic",
"/$HOME/.local/share/mime/magic",

If it can't load magic from those file, it does have some builtin's.. For example, it will detect a plain text file as text/plain, or a PDF as application/octet-stream, but with magic databases it would use the pdf mimetype.

The questionable files are the magic database, which come from freedesktop.org shared-mime-info project which is licensed with the GPLv2+, https://freedesktop.org/wiki/Software/shared-mime-info/, but being data files and not code, I'm not sure what this means.

If we were to include these data files, but not embed them in the binary, then we would need to modify tree_magic_mini to support loading the database from arbitrary locations, as it currently uses a hardcoded list.

jasonish avatar Sep 12 '22 17:09 jasonish

I think there is some tricky licensing details with mime databases here. https://github.com/zRedShift/mimemagic/ is a Go project for mime-sniffing. That project parsed the raw XML data from shared-mime-info and created Go files which it then links in, then was subject to a DMCA takedown notice by one of the contributors to shared-mime-info: https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues/154.

I think enabling the embed feature, while extremely useful, is probably not a good idea. Shipping the data as files, is also questionable, being GPL code we don't have control over.

We could do ./configure and run time startup checks to see if shared-mime-info is installed and display warnings.

Or maybe we should look at something without any license issues. I found https://github.com/bojand/infer which supports a more limited set of file types. Its MIT license, no extra files, its all in the code.

Utlmately I think we want something like this PR suggests. I'd be favour of removing magic if we had a consistent mime type sniffer across all platforms we support and did not rely on the runtime environment for data files.

jasonish avatar Sep 12 '22 19:09 jasonish

Should we make the build only if an enable-flag is passed ? This would avoid the GPL issue.

regit avatar Oct 03 '22 13:10 regit

Should we make the build only if an enable-flag is passed ? This would avoid the GPL issue.

I'm not sure I like the idea of something I see as an obvious enhancement over an existing feature being gated by a license flag.

jasonish avatar Oct 03 '22 14:10 jasonish

Ooops, didn't mean to close.

jasonish avatar Oct 03 '22 14:10 jasonish

Just to give you an idea on the impact, we did some performance testing with that feature compared to file-magic:

34 Gbit/s Traffic on 40G with baseline t-rex traffic an 4.19 kernel after 10min running:

force-mimetype: false and file-magic: false 4.0 % Drop

force-mimetype: true and file-magic: false 4.1 % Drop

force-mimetype: false and file-magic: true 43 % Drop

So for setups that use this a lot it could be a big improvement.

norg avatar Oct 26 '22 14:10 norg

Just to give you an idea on the impact, we did some performance testing with that feature compared to file-magic:

34 Gbit/s Traffic on 40G with baseline t-rex traffic an 4.19 kernel after 10min running:

force-mimetype: false and file-magic: false 4.0 % Drop

force-mimetype: true and file-magic: false 4.1 % Drop

force-mimetype: false and file-magic: true 43 % Drop

So for setups that use this a lot it could be a big improvement.

Nice performance numbers. I think in the not to long term it would be good to drop the magic support and just look at the mime-type, however I'm not sure we can proceed with this library given the license restrictions. As it would be nice to provide a "consistent" mime database across all installs, and not depend on what the user has installed.

jasonish avatar Oct 26 '22 18:10 jasonish