schemas
schemas copied to clipboard
rust: Add support for static messages
Changelog
- rust: Add support for logging static messages
Description
Today, it's necessary to republish static or slow-changing data at a fairly high frequency, to ensure that it is available to a live viz connection (or mcap file, for that matter) that joins the stream at some arbitrary point.
Instead, we can offer the ability to log "static" messages which are propagated to sinks in the usual way, but a copy is also stored on the channel. When a new sink subscribes to the channel, it receives a copy of the static message immediately before receiving any other messages from the channel.
I'm open to suggestions for nomenclature. I think either "static" or "latched" sound reasonable, but maybe someone has an even better idea.
I think the limit of one-per-channel makes sense, and is clearly documented, but it's not obvious from the name (
log_static(), c.f.log()).
Naming is the hardest thing. I kind of like Dan's "sticky" suggestion. Or may be "pinned", along the same lines. How do we feel about log_sticky() or log_pinned()?
I do think "static" makes sense. "latched" (at least, as I understand from ROS), would be a property of the channel — the last message sent to the channel in is automatically delivered to new subscribers.
Got it. Yeah, we should avoid overlapping with established ROS terminology if we aren't implementing it precisely.
That does raise the question of whether it's useful to have both static and non-static messages on a single channel, which seems to be the difference here. If not, then we could adopt 'latched channels' and stick with the same
log()function. But that does raise problems for the high-level logging API, which avoids channel configuration entirely. That would needlog_latchedor something.
That's a fair question. I think @amacneil has some use cases in mind for static messages, so maybe he has an opinion on this. I could go either way.
I don't think this is necessarily a problem for the high-level API, especially if we make it possible to declare a channel (with explicit configuration) and then subsequently use it from the high-level API.
I imagine the Python API here will stick with a single
log()message regardless, and add that as a kwarg. I'd like to adopt the same feature for python in the same SDK release — we can do a separate PR, but maybe ensure they're part of the same release?
Agreed, I'll get one cued up.
Bottom line on top: I think we shouldn't have this option, I think we should always re-send the last logged message for each channel to every new websocket connection.
ROS puts all robot functionality into dynamically restarting nodes that communicate over pub/sub. ROS can't rely on nodes retaining state over long periods of time, they need to be able to recover after restarting, and they need to collect enough state to recover over pub/sub. Latching topics (or channels) are ROS's solution to this problem.
Critically, ROS cannot have all channels use latching behavior because most robotics code assumes that every message it receives on pub/sub was published recently; it does not check the timestamp for staleness. For this reason, latching topics are only used for data that is static or almost-static.
However, for the Foxglove SDK, we don't have this constraint. The only consumer of live data (the websocket client, ie. the Foxglove App) knows about stale data and can place messages on a timeline accurately. Therefore, we can stick with one behavior for all channels, which I think should be to send the last logged message to every new websocket connection, no matter how old it is.
There are some considerations here around logging and log rotation. I believe that rosbag record will duplicate the last message from every latched channel into the next log file when rotating. This means that every log file contains enough information to reconstruct the robot state for its logged period, at the cost of some duplication. However, any software that merges these log files (the Foxglove backend is one) will need to be aware of this and de-duplicate those messages. There's no existing software that handles this well, including Foxglove and mcap merge.
IMO the initial policy of the foxglove SDK should be to never duplicate messages from one MCAP file to the next. The upside of this is that we never see duplicated messages when streaming data in Foxglove. The downside is that if a client has log rotation enabled, they will not see their slow-logged messages in rotated files. when streaming from Foxglove, the replayLookbackPolicy will not help them either, since it only looks back to the start of files overlapping the requested time range.
Later, we should update the MCAP spec to be aware of channels where the first message was copied in from before when recording started. This would allow us to handle duplicated messages intelligently on merge.
I think we should always re-send the last logged message for each channel to every new websocket connection.
I generally like this idea. And good points about the log rotation.
I think we'll need to update the handling here to preserve the original log time.
It seems possible users might be surprised in some cases. Perhaps connecting to a device which has been online for awhile and which logged some event an hour ago; some panel shows that in a way that it's mistaken for current data. That's pretty different from (say) part of a Scene which a user did intend to be static.
Bottom line on top: I think we shouldn't have this option, I think we should always re-send the last logged message for each channel to every new websocket connection.
ROS puts all robot functionality into dynamically restarting nodes that communicate over pub/sub. ROS can't rely on nodes retaining state over long periods of time, they need to be able to recover after restarting, and they need to collect enough state to recover over pub/sub. Latching topics (or channels) are ROS's solution to this problem.
Critically, ROS cannot have all channels use latching behavior because most robotics code assumes that every message it receives on pub/sub was published recently; it does not check the timestamp for staleness. For this reason, latching topics are only used for data that is static or almost-static.
However, for the Foxglove SDK, we don't have this constraint. The only consumer of live data (the websocket client, ie. the Foxglove App) knows about stale data and can place messages on a timeline accurately. Therefore, we can stick with one behavior for all channels, which I think should be to send the last logged message to every new websocket connection, no matter how old it is.
This makes sense. I like the idea.
We could save the last message on each channel and resend it when add_channel is called on the websocket client sink.
How would we handle a case where no sinks are attached to the channel, would we always have to serialize and save the message just so we could save it on the channel in case a websocket client is later added? That defeats the whole purpose of has_sinks() and trying to avoid paying the serialization cost if there is no sink. Maybe that's the right thing to do unless the user somehow indicates they will not create a websocket server?
There are some considerations here around logging and log rotation. I believe that rosbag record will duplicate the last message from every latched channel into the next log file when rotating. This means that every log file contains enough information to reconstruct the robot state for its logged period, at the cost of some duplication. However, any software that merges these log files (the Foxglove backend is one) will need to be aware of this and de-duplicate those messages. There's no existing software that handles this well, including Foxglove and
mcap merge.IMO the initial policy of the foxglove SDK should be to never duplicate messages from one MCAP file to the next. The upside of this is that we never see duplicated messages when streaming data in Foxglove. The downside is that if a client has log rotation enabled, they will not see their slow-logged messages in rotated files. when streaming from Foxglove, the
replayLookbackPolicywill not help them either, since it only looks back to the start of files overlapping the requested time range.Later, we should update the MCAP spec to be aware of channels where the first message was copied in from before when recording started. This would allow us to handle duplicated messages intelligently on merge.
I think the latched message should only apply to the websocket server, and not be used at all by mcap logging. If you attach a new mcap file, it won't re-log the last logged message. I'm not sure if that's what you're getting at, I'm confused how we got from live viz to mcap.
I'm generally in favor of simplifying user interfaces, but I'm not yet convinced that this is a win in that respect.
Pros:
- Fewer knobs for the user to get confused about
Cons:
- More complicated logging semantics to explain to the user, especially if ws & mcap behave differently
- Runtime overheads:
- Need to allocate heap buffers for every message; can't use the stack for small messages
- Need to serialize all messages, even if there are no sinks subscribed
- No benefit for messages that are logged at high frequency
- Possible surprises involving stale data that the user never wanted to latch, and now can't avoid latching
I disagree about treating mcap files differently, with respect to message latching. The firefly automatix folks talked about this specifically - they said they wanted each recording file to be self-contained, which means including latched data. They said that the way they rotate (or split) files today, that isn't the case, and it's frustrating. Personally, I'd rather have a recording file, as an export format, be self-contained. How we store timeseries data in the cloud is a separate matter.
Consider the following workflow. Suppose I'm using the live viz, and I notice an interesting behavior that I want to record and upload. So I cue up the robot, click a button to start recording an mcap file locally on the device, have the robot do its thing, and then upload the mcap file to foxglove. If the contents of that recording differ from the contents displayed in the live viz - e.g., by omitting slow-updating/static data - that's a bug from the user's perspective.
Why is it hard to deduplicate messages when merging recordings? We have timestamps and sequence numbers, which should make it possible, though I suppose it might get a little more difficult if you process recording files out of order.
OK, i see the points and am convinced.
I dont' think deduplicating messages by sequence number + channel content + timestamp will work across all MCAP files in existence, but we could certainly establish a convention for MCAPs that are produced by the SDK, and follow that convention within Foxglove. We should choose a unique profile string for SDK-produced MCAPs if we haven't already.
Some of my thoughts and context on static/latching
tl;dr; long-lived-static-data is always a mistake and trades off a slight convenience during "process restart"/"runtime" for hard-to-work-with data downstream when trying to debug a problem. In every single case where someone was doing some long-lived-latching (many minutes/hours) their debugging/observability life could have been made much easier had they published their data at some periodic rate.
I believe it is important to avoid conflating latching with static data. The two are not the same though might often appear in similar context. From what I have experienced, latching is the idea that a new subscriber to a topic receives the latest message on that topic - without waiting for the producer to create a new message on that topic. This is distinct from static information which "in theory" does not change for the lifetime of a robot. Latching tends to be a setting on the publisher side because most of these "pub/sub" middleware don't tell you when a subscriber is present - so you defer this activity to the middleware layer. Its also seen as being a publisher responsibility to know if the data "is latchable" (i.e. not considered stale after some time).
Note: ROS1 had a latching concept that was removed in ROS2 in favor of a history size where you could even replay some number of messages when a node re-connected.
static topics tend to use "latching" as an option for the publisher but have additional semantics on top of publisher latching. They are topics that one does not imagine change frequently (or ever) for a lifetime of some process. The typical example from ROS is the tf_static topic which gets special semantics in ROS transform handling by not timing out the transforms on that topic (ROS will otherwise timeout regular transforms after 10 seconds). But having this special handling of a topic introduces its own gotcha - systems have to agree that they will treat this topic specially. This is possible/viable with channel metadata if we want to explore those approaches. The notable thing here is that this example is specific to ROS.
Video data is another interesting example. To create an image frame from video data you need to start with a keyframe and apply all the delta frames up to your timestamp. If you cycle an mcap file between two keyframes, you won't be able to view that mcap file on its own without reaching to the previous mcap file. Practically - it can be quite hard to attain the "mcap file should stand on its own" idea which is why I push folks to think of mcap logging as a sequence of files and not get hung up on "single file" workflows - but they are convenient.
Ultimately the ideas of static data, latching, video, and rolling mcap files has gotchas and tradeoffs for runtime and (IMO) equally important observability time. It can be easy to shrug off the downstream observability impacts of decisions like "static" data for the person who only cares about their small part of the world or runtime - but when thinking about the entire lifecycle of debugging it becomes something that makes every step that much more complex.
Consider the following workflow. Suppose I'm using the live viz, and I notice an interesting behavior that I want to record and upload. So I cue up the robot, click a button to start recording an mcap file locally on the device, have the robot do its thing, and then upload the mcap file to foxglove. If the contents of that recording differ from the contents displayed in the live viz - e.g., by omitting slow-updating/static data - that's a bug from the user's perspective.
Not to dissuade us from trying to find solutions but this generally is a hard problem. What if your robot has scene entity state? The best advice I can give to such workflows is to reduce state and to ensure your data publishes at some predictable rates.
Why is it hard to deduplicate messages when merging recordings? We have timestamps and sequence numbers, which should make it possible, though I suppose it might get a little more difficult if you process recording files out of order.
It is valid to have two messages at the same timestamp - how do you know they are duplicates? Sequence numbers would be more viable but not everyone uses those.
We're going to close this for now, for lack of demand.