Content Type set by HTTP Gateway
HTTP Gateway does content-type sniffing based on golang.org/src/net/http/sniff.go and file extension. js-ipfs uses similar setup.
Problem: there is no mechanism for website creator to override returned content-type, setting custom file extension works only for some file types.
Example
The same data produces different content-type, depending on request path.
SVG image
https://ipfs.io/ipfs/QmVdFJJBiQkVKFcvXu4WzySbZ7KnCW6uGWLJqZz5FnRWjk/ipfs-logo.svg
→ returned as image/svg+xml
XML document
https://ipfs.io/ipfs/QmVdFJJBiQkVKFcvXu4WzySbZ7KnCW6uGWLJqZz5FnRWjk/ipfs-logo.xml
→ returned as text/xml
Unknown extension
https://ipfs.io/ipfs/QmVdFJJBiQkVKFcvXu4WzySbZ7KnCW6uGWLJqZz5FnRWjk/ipfs-logo.foo
→ returned as text/plain
Raw CID
https://ipfs.io/ipfs/QmTqZhR6f7jzdhLgPArDPnsbZpvvgxzCZycXK7ywkLxSyU
→ returned as text/plain
Raw CID + explicit filename
https://ipfs.io/ipfs/QmTqZhR6f7jzdhLgPArDPnsbZpvvgxzCZycXK7ywkLxSyU?filename=/ipfs-logo.svg
→ returned as image/svg+xml
Motivation
We want IPFS to become viable solution for hosting websites. At the HTTP level, as a bare minimum, website owners expect to able to override:
- content-type of specific files / file types
- error pages (4xx, 5xx)
Ideas to explore
(A) Embedding content-type in DAG-PB (UnixFS metadata)
One way to address this is to support embedding Content-Type in UnixFS DAG metadata.
It would be opt-in (like mode and mtime).
TBD if filename should override content type embedded in the dag.
This is tracked in https://github.com/ipfs/specs/issues/364
(B) Drop-in config to override content-type per directory
@warpfork noted that DAG metadata may not be the best place for storing content-type:
https://github.com/ipfs/specs/issues/217#issuecomment-527198592 +1 towards the idea that if [Content] type is getting well-known support, it should be something we move towards the gateway knowing of it, rather than making it a feature of the filesystem.
This would be a much closer set of relationships to how the rest of the world works already (e.g. doing sysadmin today with nginx or something, I would generally configure [Content] types at the webserver area, and not in filesystem metadata) -- and thus seems much less likely to go awry.
Carefully avoiding baking in the idea of a single "mimetype string" field into our filesystem metadata also leaves much more room for issues to evolve around the things Ian mentioned:
- a file can have multiple mime types depending on the context
- some mime types can't be deduced until the entire file has been read
My take on this is:
- mind, we did exactly the opposite with
mtimeandmode– UnixFS 1.5 embedds them in dag-pb - we could support both ways. e.g., website creator would add something like
_headersto the directory, and Gateway would do the right thing when resource from directory or its subdirectories are requested- presence of the config file would disable content sniffing on both server and client (
X-Content-Type-Options: nosniff)
- presence of the config file would disable content sniffing on both server and client (
See _headers in https://github.com/ipfs/specs/issues/257
References
- Storing Explicit Content Type: https://github.com/ipfs/unixfs-v2/issues/11
- SVG files being sniffed incorrectly (https://github.com/ipfs/faq/issues/224#issuecomment-278156252, https://github.com/ipfs/js-ipfs-http-response/pull/5, https://github.com/ipfs/js-ipfs/pull/1482)
- prior art for drop-in config that travels with data
cc @olizilla @autonome
* prior art: `.htaccess`, `.gitattributes`
wouldn't his mean that every request to the gateway becomes two request (one to the actual content, the other to figure out if .htaccess-clone exists). This may be expensive.
And. if using different extensions on the filename is effectively setting the content type guessed for that file, isn't this precisely a way to hint/override the content type of certain content?
wouldn't his mean that every request to the gateway becomes two request (one to the actual content, the other to figure out if .htaccess-clone exists). This may be expensive.
It looks that way, however (iiuc) if gateway wants to resolve /ipfs/{cid}/foo/bar/cat.xyz to a CID it needs to fetch and cache dag roots of /ipfs/{cid}/, /ipfs/{cid}/foo/ and /ipfs/{cid}/foo/bar/.
This means checking if .ipfs exists in any of them does not trigger additional fetch: dag with directory listing is already cached in local repo, which should be cheap to check by the gateway.
if using different extensions on the filename is effectively setting the content type guessed for that file, isn't this precisely a way to hint/override the content type of certain content?
Unfortunately extension-based sniffing relies on arbitrary mapping hardcoded in go-ipfs and works only for popular file types, such as SVG. Publishing file with .sxg extension did not set correct content-type (example below).
Real life example: .sxg
Signed HTTP Exchanges (https://github.com/ipfs/in-web-browsers/issues/121) are bundled as .sxg files. Chrome won't load them unless .sxg is returned with specific content-type (at the moment it is application/signed-exchange;v=b3). Right now ipfs.io has a special Nginx rule that overrides content-type for .sxg, but this obviously does not scale well, and will break old snapshots when we globally update to a new version. On top of that, future specs add more content types.
It is a good illustration of use case where a person publishing file would want to override content-type of a specific file locally and ensure every gateway returns a valid one.
Just FYI there is accepted proposal https://github.com/ipfs/go-ipfs/issues/6214 for support of .ipfs-gateway.(json|yaml). Let see how implementation will move on.
Has much progressed in terms of having a 404 page for ipfs hosted websites?
I believe _redirects is work-in-progress, and _headers will be next – see recent status update in https://github.com/ipfs/specs/issues/257#issuecomment-1077484817
When we have that, we may allow customizing Content-Type header via _headers file (tbd, needs security analysis).
An alternative idea is to do what we did for opt-in mtime and mode and allow opt-in mtype as part of dag-pb.
Looking for early feedback in https://github.com/ipfs/specs/issues/364 (no IPIP yet).
What if I want to serve media from an IPFS gateway for my website, but I do not want to allow application/javascript in the content-type header? We need the ability to control allowed types too, not just make detection good.
Good news, the default IPFS gateway returns Content-Type: text/plain; charset=utf-8 for javascript files, even if they end with a .js extension.
EDIT: Same isn't true for svg, though. https://github.com/allanlw/svg-cheatsheet
EDIT2: svg exploits don't work when used in an img tag, so it's okay.