More robust MIME type guessing
Feature Request
- Is your feature request related to a problem? Please describe clearly and concisely what is it.
The MIME module currently provides various methods for resolving a file's MIME type. However they all simply just lookup the file's extension in the database. This of course won't be able to handle files without an extension, or whose extension is not representative of the file.
- Describe the feature you would like, optionally illustrated by examples, and how it will solve the above problem.
I propose that the MIME component adds a new method .guess_mime_type or something along those lines to handle this use case. The implementation of which would be pretty simple via libmagic.
- Describe considered alternative solutions, and the reasons why you have not proposed them as a solution here.
The alternative here is to just not bake this into the stdlib and continue to handle it via external shards. This is of course a totally valid approach, but would be a nice addition given its simplicity and how MIME has some level of this feature already included.
- Does it break backward compatibility, if yes then what's the migration path?
It does involve a new system lib dependency that would need to be taken into consideration. But it's likely widely available enough to where it won't be an issue. Tho I'm unsure what the situation is on Windows...
If we so desire, we could also make it an opt-in dependency via require "mime/guesser" or something like that. I.e. if you don't require the file manually nothing would be using libmagic thus it wouldn't be required.
EDIT: Tho mime itself is already opt-in so maybe that'll be enough.
It would probably be a good idea to start building a good shard for this. And then we can consider incorporating it into stdlib.
An dependency-free alternative to wrapping libmagic would be a native implementation, possibly using the data from libmagic.
Alright, I can integrate something into Athena's MIME component to start. Not going to be the exact same API, but should be pretty close.
I also just realized that none of these files exist on my machine:
https://github.com/crystal-lang/crystal/blob/54022594f84040c976634863ce5fac1b31a68048/src/crystal/system/unix/mime.cr#L2-L12
So it's only ever been using the defaults we define. It seems /etc/mime.types is part of mailcap package. But ideally it could make use of shared-mime-info as it seems a lot more robust, but it uses a diff format so would require some extra work.
Maybe at least we should print a warning or something if you don't have any?
EDIT: Seems PHP's symfony just generates a big hash of these mappings. Has the benefit of making it always work w/o needing to read databases at runtime. The cost of course being it needs to be manually refreshed every now and then.
EDIT2: Other con is thats like another few 100K added to binary size :/
Yeah, such a mime database is usually not installed on a base system unless you have some additional package.
It wouldn't be a bad idea to offer a way to bake the information into the executable. This could be optional if we want to avoid the memory implication. However it only matters if you use MIME and for most applications the added size won't make a difference.
NOTE: shared-mime-info is licensed under GPL, so you can't bundle it.
Put together https://github.com/athena-framework/athena/pull/534 to handle this for my needs. Windows seems to be the more annoying part to deal with. Would have to look into building/distributing a dll for it, but for now unix and MSYS2 is good enough.
Embeds a big hash for the MIME type <=> file extension lookups based on shard-mime-info data. Not bundling their actual code so it's licensing wise. Happy to extract some of it into stdlib if we'd like. Otherwise will be available in next release of that component.
@Blacksmoke16 Any plans to make this a shard maybe? I'd be very interested in something similar and also help with it.
@dup2 It already is yea! https://athenaframework.org/MIME/Types/#Athena::MIME::Types was added as part of the MIME component's v0.2.0 release.