Cache schema / channel IDs in Python Writer
Changelog
Add caching to register_schema and register_channel methods on the Python Writer class.
Docs
I don't know that there is an issue for this. Probably there should be one made since this is a behavioural change.
Description
The Rust crate for mcap has this handy feature where channel and schema IDs will be cached via a map when "registered" or added to a writer. What this means in practice is that if you add the same schema contents (or conversely channel contents) multiple times to a writer, then you will always get the same schema ID / channel ID back (and no record will be written, since the original record was already written to the MCAP).
This is useful, as you seldom want to write the same channel information across multiple IDs. If you don't have this feature, you have to cache this data yourself. Unfortunately, while Rust does the "right" thing (right being defined as according to "what did I expect to happen"), the Python library does not. It forces users to cache the IDs in their own application.
For the purpose of this change, I more or less just cached based off of everything passed to either the register_schema or register_channel functions. The intent was to be analogous to the behaviour in the Rust crate, but in a Pythonic way.
Note that I couldn't just use functools.cache because these were methods which accepted self. As such I needed to make a small wrapper. The wrapper itself is something you can probably find across dozens of stackoverflow answers, e.g.:
https://stackoverflow.com/a/68052994
That being said, I think the change is relatively straightforward and so it would be hard to modify this or rewrite it from scratch in any meaningful way.
Ultimately, this is a behavioural breaking change. Users who are already tracking IDs for schemas and channels will probably notice that they are doing so needlessly if this is pushed. Users that aren't will notice that they are no longer writing the same channel / schema records multiple times. I mostly figured that the other mcap libraries in other languages default to this behaviour, so it was a bit odd that it wasn't present here; however, I could see how making a breaking change like this could be annoying, since you can't choose not to cache with this commit applied.
I would say that if that is the case though, then you are probably doing something a bit outside of the norm. 🤔 I don't know if I can try to argue for such a scenario, because it seems strange to want more channel records that would otherwise cause issues.
NOTE: I noticed that cached IDs was a problem after I did not cache them myself in a python script, and resulted in running over the limit for total number of schemas / channel IDs (which is u16::MAX). 😬 So this change also helps you avoid exhausting your schema / channel IDs by doing something stupid.
I added the unit test test_schema_channel_caching which utilizes generate_mcap_schemas_channels (also new, although a bit copy-and-pasted) to validate this. Seems to work as expected on my end.
| Before | After |
|---|---|
|
I don't know how to display this but prior to this change you would have to manage schema / channel IDs externally lest you write the schema and channel to the MCAP file multiple times. |
|