Feature Request: Container Metrics Charts in Stats Tab
Overview
I would like to contribute to implementing real-time charts for container metrics (CPU, memory, network) within the Stats tab to enhance monitoring capabilities.
Technical Considerations
Performance Impact
Implementation of streaming metrics will introduce some overhead if we stream data from the agent side, since we would have to stream it to master first.
WebSocket Strategy
I'm thinking about two approaches for metrics data transmission:
Option 1: Dedicated Metrics WebSocket
Separate WebSocket connection for metrics streaming Independent connection management and error handling
Option 2: Single WebSocket
Utilize existing WebSocket connection for all communication Simpler connection management
I'm leaning toward Option 1 for separation of concerns, performance optimization, and architectural clarity.
I'm ready to begin implementation once we align the approach.
Hi again Alan,
Thanks a lot for opening the issue, and for going through your reflexion with me.
To move forward, I would like to suggest to do the following :
-
Define a list of metrics to gather from the containers. (What is "CPU / Memory / Network" exactly?)
-
Define how those metrics will be gathered from the containers. (Is there a native Docker API for that?)
-
Determine whether these metrics will always be available, or may be unavailable for certain images. (Thinking about barebones / scratch / blank images). If those metrics are unavailable in certain cases, define what we should do (like, show an empty chart or a simple message to instruct the user about what's missing?)
As for the transmission architecture with the websockets, I am personally more inclined to keep everything on the already-existing single websocket channel (that would be your Option 2). That would enable us to reuse everything that's already working in the code (authentication, message chunking for large payloads, single node + multi-host + multi-node communication, etc.).
Regarding that point still, am I wrong to think that there wouldn't be any overhead at all?
I imagine the following scenario :
-
A new tab is added on the container panel on the right, that would be the "Stats" tab.
-
By default, nothing from that tab would be loaded unless we navigate to it (as is the case for all tabs in general).
-
When the user navigates to that tab, the browser periodically (500ms, for example) sends the same command, maybe "container.stats" with the container ID as parameter, and the server fetches the latest stats from the container identified by the ID specified.
-
Everything is then sent from the server to the browser over the current websocket, in the form of arrays of numbers (or any format that makes sense), such as : CPU : [ 1%, 5%, 3%, 8% ], Memory : [ 300 Mo, 127 Mo, 823 Mo ], and so on.
Building on top of that, I also imagine that a few optimizations could be made :
-
Maybe the server could, on its own, fetch the containers metrics at a set interval, so the metrics would already be known even before they're queried by the browser.
-
Maybe the browser could store the metrics offline (localStorage or other) as they're received, so the server only has to send one data point on each call, rather than a full time series with past data.
Sorry, that's a lot of maybes and questions, but I feel that, if we choose to work on that feature, we'd better prepare ourselves very well.
Let me know what you think, and thanks a lot once again.
Hi Will,
Thank you for the detailed feedback and questions! I appreciate you taking the time to think through the implementation details.
Let's address the points you mentioned:
Regarding metrics definition :
Looking at the Docker API, I think we can start with CPU and memory percentage as the core metrics, then add support for network I/O and block I/O later. If we can't retrieve those metrics for certain containers, we should render an empty plot and not stream anything.
Regarding websockets
I completely agree on reusing the existing websocket channel. You're right that we should just reuse the existing infrastructure - with authentication and etc.
Regarding overhead
I was thinking there might be some overhead since we're not directly connected to the remote docker socket (in case of multinode we get the stream from docker to agent node, then to master node), but in practice this might be negligible.
Implementation approach:
Your proposed architecture makes a lot of sense, but I was thinking that we would have to somehow keeping metrics collected even when we left from that particular container stats tab, so to keep plot smooth and we would have information about what happened when didn't monitor the chart.
Proposed flow:
We will use container stats api and we would read from the stream which this api provides.
The backend would spawn a goroutine for processing container metrics from that stats stream, with another goroutine in handler ticking every second to send metrics via websocket.
When client initiates metrics polling command, the master starts streaming from either the agent or docker socket directly. When users leave the tab (open other container), we send stop command, we stop gorutine sending metrics for that container but keep the goroutine working in the background and still updating metrics array. When user comes back to this container, we resume and send resume command, backend send the accumulated metrics array and client rerenders the plot.
Memory Management:
We should propably store metrics in a circular buffer to prevent memory overflows from long-running containers or situations where users leave tabs open for extended periods.
What do you think about this approach?
Hi Will, Just wanted to follow up on this feature request. Regarding the container metrics charts, what's the next step in our implementation? Should I go ahead and write a more detailed design or should I develop small proof of concept? Thanks!
Hi again Alan,
Once again, thank you very much, really.
I think we've almost reached a stage where we can start the implementation and see the light with that feature.
Before jumping on the implementation, I would just like to explore a few potential caveats and scenarios :
-
I definitely agree with you on the idea of collecting the metrics in the background in order to later show a chart with historical data points. However, in practice, how would that work for someone who has, let's say, 20 or 100 containers? The server would be constantly polling the Docker API for those 20 or 100 containers in parallel, right? I don't know if that's an issue at all (I think it isn't, because we're just throwing N requests at the Docker API, but I wonder if, upon our request, Docker runs processes that are heavy on the CPU or memory. I'm afraid that adding that feature request as a whole will make the app heavy on the server side because of that polling)
-
I would like to suggest a variation of your flow for exchanging metrics between the client and the server because, during development, I have noticed that, when the server expects the client to send a signal to release a resource, it often leads to issues where the server keeps the unused resource in memory, and it accumulates. In my experience, that issue may come from one of : unstable connection, closing the browser, closing the tab in the browser, the browser automatically putting the tab to sleep if it's not active. In all those cases, the client fails to reliably send a "stop" message to release the resource, and over time, unused orphan resources may hog the memory, or the server would be sending messages to the client when, client-side, the code doesn't expect anything to be received. That is just one thing. The other thing is that, in terms of code, if we wish to emit a "stop" signal from the client when the user leaves the "Stats" tab, then it means we must insert our logic in every possible scenario where the user could have left the tab. It could be : clicking in another panel, clicking on another row, opening a menu, pressing a keyboard shortcut, opening a terminal, etc. With the first problem mentioned, and with that one on top of it, I think we should find a flow where the client doesn't need to send a "stop" message at all.
-
Building on the previous point, here's the flow that I think could work better, and be easier to implement :
- The user navigates to the "Stats" tab of the container.
- The browser, over WebSocket, sends a first command to the server : "container.stats" with the ID of the container.
- That command embeds another parameter "from"
- That parameter determines the latest data point known by the client. Here, for a first load, it equals 0.
- The server receives that command, and sends all the historical data points accumulated.
- The client receives the historical data points, and keeps sending "container.stats" with the "from" parameter
- Here, "from" is set to the latest known point received from the server (a timestamp).
- The polling command is sent by the client every N seconds or milliseconds using the JS
setIntervalcall. - The
setIntervalcall could check, as its first block instruction, whether the current tab is "Stats". - If not, then the
setIntervalcall could stop itself and prevent further calls.
- Every time that the server receives a new "container.stats" command, it just sends the latest points it has fetched in the background.
-
The reasons why I prefer that flow are the following, but please feel free to correct me and point out anything that's wrong with my approach :
- No need for a goroutine to stream data from the server to the client. -> Only simple one-off commands sent at an interval
- No need to rely on a "stop" command from the client to close a stream or release a resource.
- No need for different commands to handle the metrics communication. -> Only one simple "stats" with a "from" parameter.
-
Finally, I think we should find how, following up on the very first point, the server should handle the collection of metrics from N containers in the background, completely autonomously and without user intervention. For that, I searched a bit, and I think I found the gocron package. We could have a dedicated job running at a fixed interval, that job calls the ContainerStatsOneShot method from the Docker API, and it fills an in-memory / in-file circular buffer. Would that work?
Thank you once again a lot for your time and work, and looking forward to getting your feedback on those points.
Hi Will,
Thank you so much for taking the time to thoroughly analyze the implementation details and potential edge cases. I really appreciate your efforts 🙏
Performance Considerations
Regarding your concerns about CPU usage with multiple containers, I think polling every 3 seconds strikes a good balance. Here's what we might expect:
- 4-core system: ~5-15% CPU usage with 100 containers
- 8-core system: ~2-8% CPU usage
- 16-core system: ~1-4% CPU usage
If we encounter performance issues in practice, we could implement a worker cap (say, 50 max concurrent polls) and stop least frequently accessed workers.
Client-Server Communication Flow
I have missed reliability issues you mentioned (unstable connections, tab closures, browser sleep modes) would indeed be a problem here. I really like that this solution eliminates the need for "stop" signals and overall simplification of the communication.
The from parameter is also nice - allows for both initial data loading and further updates with unified interface.
Background Data Collection
For autonomous metrics collection, I like your gocron suggestion, though I wonder if a simple ticker with goroutines might be sufficient for our needs:
func pollStats(ctx context.Context, container *Container) {
ticker := time.NewTicker(3 * time.Second)
defer ticker.Stop()
for {
select {
case <-ticker.C:
// Check if container was accessed recently
if time.Since(lastAccessed[container.ID]) > 30 * time.Minute {
return
}
stats, err := client.ContainerStatsOneShot(ctx, container.ID, false)
case <-ctx.Done():
return
}
}
}
This approach would automatically stop polling containers that haven't been accessed recently, helping with resource management.
With these details sorted out, I think I can start implementing this feature now.
Looking forward to your thoughts on these implementation details!
Hi again Alan,
Thank you very, very, very much.
I love your approach with the Ticker in place of the gocron library, and I didn't even know that feature existed. I'm on the same page with you, and I'm happy with that solution as it embraces a native-first code, and removes the need for an extra dependency.
For the polling strategy based on access frequency with your code excerpt, I'm definitely on board with you as well. It would require to keep a map of Container ID <-> Last Access Timestamp in memory on the server, and update that map on every container.* command received.
Maybe one very last thing : How are we drawing the charts on the frontend? I'm wondering whether we should go with an ASCII charting library and just output the characters in the HTML, or use a "regular" charting library (chartjs, apexcharts, etc), and style it to mimick a terminal UI.
In any case, you can definitely jump on the train and start working on that feature if you feel that you can do it, and if you have time of course. Otherwise, I'm not against doing it myself since I'm familiar with the project's code base. We can also work and talk together, and see if we can share the workload.
You tell me, and I'm happy to see that feature come alive as we exchange.
Thanks a lot once again
Hi Will,
I couldn't find an ASCII charting library, but we could consider using a lightweight JavaScript library instead. uPlot seems like a good option—it's minimal, doesn't include built-in animations or unnecessary features, and I believe it offers enough flexibility to style it like a terminal UI.
I'm definitely eager to jump into the implementation. I have the time and would love to take this on. I'll follow the strategy we discussed and open a draft PR soon so we can keep the conversation going as I make progress.
Thanks again for your time and support—really appreciate
Alright Alan, you rock.
Thank you very much, and I look forward to what's coming next!
PS : Perfect for uPlot as well. The lighter, the better.