multicodec
multicodec copied to clipboard
feat: assign codes for MIME types
And add a script to automate this.
fixes #4
Questions:
- [ ] Names: Currently, I've left the mime types as names. We might want to use
mime/as in #84. - [ ] Range: I've allocated quite a large range (0.4% of the 4 byte range). Most of this is reserved.
I think that we should factor out the consumption of the entire multiformat table in our multiformat clients before we drastically increase the size of the table. The parse time and table size are already noticable in our bundle size, this would make that much worse. My older http clients include a full mime database and it makes it unsuitable for browser bundles. Luckily, we already have a tentative plan to move to using integer references that won’t require the full table in code.
The import script is not future-proof. If I understand it correctly, then the script parses the XML file and assigns numbers to the codecs based on the order in the XML file. They are sorted alphabetically in the XML file, so if a new mimetype is added and you re-run the script, you would end up with different codes.
Unless there's a bug, it first loads the already-assigned numbers. Then, for all new mime types, it assigns increasing (unique) codes.
Some
s have a date attribute, it looks like it got introduced in 2014. Hence I propose sorting the records by date attribute first (the ones without a date first) and if they have the same date alphabetically based on . This we we hopefully end up with a future-proof reproducible assignment of codes.
Sounds like a good starting point. I'll keep the current "don't change things" logic but having a stable conversion would be nice.
@mikeal Hm. Fair point. Even compressed, the table is going to grow from 3K to 22K. Or 26K to 260K uncompressed.
I'm fine leaving this in a PR for now. I submitted it because we kept getting requests to do something like this.
Unless there's a bug, it first loads the already-assigned numbers. Then, for all new mime types, it assigns increasing (unique) codes.
Oh I missed that when I read the code. Though I'm still in favour of having a stable conversion that can be run at any time and results in the same output.
I've put on my todo list to make this stable. I don't know when I will find the time to do it. When this issue becomes urgent to be merged, please let me know and i'll prioritize it.
A downside of a 4-byte range is it makes a base32 sha256 CID 64 characters which doesn't fit in a DNS segment. A 3-byte range would work and with a single range instead of sub ranges for each major type it wouldn't reserve too much of that space.
What's the status of this? I'm hoping to use CIDs to refer to image types soon.
I have a suggestion, perhaps instead of one big table.csv, we have multiple tables. And use certain codes to go down address spaces.
Also I noticed that image/jpeg is missing from the table. This is a very strange mime to not have in the csv. Also image/gif.
Another question, adding in mimetypes overlaps with existing codecs in the tables.csv. For example how do we compare application/json to json which already exists in the table.csv?
Looks like this effort has been stalled for a while, mostly due to concerns around the drastic increase in table size?
The readme of the project describes a first-come, first-serve policy when it comes to adding new codecs, and I wonder if we could maybe apply that here as well with mime types. I.e. maybe we can start with a small handful of the most commonly used MIME types on the internet today (say, this list), and then add more over time based on demand, instead of dumping in all known mime-types at once?
Is there some particular need for all the mime types to be in a contiguous block that I'm not aware of?