balena-supervisor
balena-supervisor copied to clipboard
balenaCloud CPU Usage metric not updating as expected
balenaCloud tracks CPU usage for a device. This value appears on the dashboard as the CPU indicator. Some customers and I have noticed that the CPU value seems to get "stuck", often at a high level for an hour or more. For example, see this forum topic.
Below are some observations about metrics collection in general, but they apply to the specific problem of CPU usage. In summary I think the feature is working as written. There are ways to make it more responsive without adding load to the system.
Background
The values in balenaCloud are the result of the Supervisor sending data from the device and the balena API accepting the data at the server. The server accepts a new system metric report approximately every 200 seconds (3 1/2 minutes). In other words, if the server receives a changed metric for a device less than 200 seconds after the previous report, it discards that metric. This change is relatively recent, and may explain why we are hearing of this issue now.
Also the server accepts any provided metric values, and does not expect all metric values at once. If a value is not provided, the existing value is not changed.
The Supervisor takes metric readings every 10 seconds, and uses buckets to determine when to send to the server. The buckets are:
- CPU usage -- 0, 20, 40, ... 100 %
- CPU temperature -- 0, 5, 10, ... C
- Memory usage -- 0, 10, 20, ... MB
The use of buckets is significant. It means that the supervisor considers 99 MB to be in 90 MB bucket, but 100 MB is in the 100 MB bucket.
Observations
- There is inherent jitter in the cpu_temp and memory_usage readings. See examples below.
isSignificantcalculation uses absolute bucket boundaries as described above. So if memory jittering on a boundary, like 319 - 320 MB, generates signficance. See examples below.- Only publishes the metric that changed. So if only CPU temperature changes significantly, does not send CPU usage, memory usage, or storage usage. Anecdotally I found that adding load resulted in a significant CPU temperature change on a metrics collection iteration preceding the iteration for a significant CPU usage change.
- cpu_usage and cpu_temp are linked/dependent variables. Since cpu_temp is jittery, its changes tend to dominate cpu_usage.
- The intervals for measurement changes are significantly different on the server and the Supervisor 200 sec vs. 10 sec. This leads to confusion for users. The Supervisor compares readings from as recently as 10 seconds ago, but the user sees a previous value from possibly many minutes earlier.
Examples
All of these examples were collected on an Intel NUC running balenaOS 2.89.15 and Supervisor 12.11.42
Added logging when Supervisor determines if a metric has changed buckets (isSignificant). Notice especially the jitter in memory_usage at bucket boundaries.
Mar 20 17:50:38 [info] memory_usage isSignificant: 318 to 327
Mar 20 17:50:38 [info] cpu_temp isSignificant: 32 to 42
Mar 20 17:50:38 [info] Reported current state to the cloud
Mar 20 17:50:48 [info] memory_usage isSignificant: 327 to 318
Mar 20 17:50:48 [info] cpu_temp isSignificant: 42 to 30
Mar 20 17:50:48 [info] Reported current state to the cloud
Mar 20 17:51:18 [info] memory_usage isSignificant: 318 to 320
Mar 20 17:51:18 [info] Reported current state to the cloud
Mar 20 17:51:38 [info] memory_usage isSignificant: 320 to 318
Mar 20 17:51:38 [info] Reported current state to the cloud
Mar 20 17:51:48 [info] cpu_temp isSignificant: 30 to 35
Mar 20 17:51:49 [info] Reported current state to the cloud
Mar 20 17:51:58 [info] cpu_temp isSignificant: 35 to 31
Mar 20 17:51:58 [info] Reported current state to the cloud
Mar 20 17:52:48 [info] memory_usage isSignificant: 318 to 320
Mar 20 17:52:48 [info] Reported current state to the cloud
Mar 20 17:53:08 [debug] Attempting container log timestamp flush...
Mar 20 17:53:08 [debug] Container log timestamp flush complete
Mar 20 17:53:09 [api] GET /v1/healthy 200 - 2.180 ms
Mar 20 17:53:18 [info] cpu_temp isSignificant: 31 to 38
Mar 20 17:53:18 [info] Reported current state to the cloud
Mar 20 17:53:28 [info] cpu_temp isSignificant: 38 to 30
Mar 20 17:53:28 [info] Reported current state to the cloud
Mar 20 17:54:08 [info] memory_usage isSignificant: 320 to 319
Mar 20 17:54:08 [info] Reported current state to the cloud
Mar 20 17:54:18 [info] memory_usage isSignificant: 319 to 321
Mar 20 17:54:18 [info] Reported current state to the cloud
Mar 20 17:56:08 [info] memory_usage isSignificant: 321 to 319
Mar 20 17:56:08 [info] Reported current state to the cloud
Mar 20 17:56:18 [info] memory_usage isSignificant: 319 to 320
Mar 20 17:56:18 [info] Reported current state to the cloud
Mar 20 17:56:28 [info] memory_usage isSignificant: 320 to 317
Mar 20 17:56:28 [info] Reported current state to the cloud
Mar 20 17:57:58 [info] cpu_temp isSignificant: 30 to 35
Mar 20 17:57:58 [info] Reported current state to the cloud
Mar 20 17:58:08 [info] cpu_temp isSignificant: 35 to 31
Mar 20 17:58:08 [info] Reported current state to the cloud
Mar 20 17:58:09 [api] GET /v1/healthy 200 - 1.739 ms
Mar 20 18:00:08 [info] memory_usage isSignificant: 317 to 320
Mar 20 18:00:08 [info] Reported current state to the cloud
Collected CPU temperature readings every 2.5 seconds in a separate container using the same systeminformation package as the Supervisor. Notice the inherent jitter in temperature readings, including 14:27:50 outside the 30-34 bucket.
Mar 19 14:27:00 {'CPU': 0, 'Temp': 32}
Mar 19 14:27:03 {'CPU': 0, 'Temp': 31}
Mar 19 14:27:05 {'CPU': 1, 'Temp': 32}
Mar 19 14:27:08 {'CPU': 0, 'Temp': 32}
Mar 19 14:27:10 {'CPU': 0, 'Temp': 34}
Mar 19 14:27:13 {'CPU': 0, 'Temp': 32}
Mar 19 14:27:15 {'CPU': 0, 'Temp': 32}
Mar 19 14:27:18 {'CPU': 0, 'Temp': 30}
Mar 19 14:27:20 {'CPU': 0, 'Temp': 32}
Mar 19 14:27:23 {'CPU': 0, 'Temp': 32}
Mar 19 14:27:25 {'CPU': 1, 'Temp': 32}
Mar 19 14:27:28 {'CPU': 0, 'Temp': 33}
Mar 19 14:27:30 {'CPU': 0, 'Temp': 34}
Mar 19 14:27:33 {'CPU': 0, 'Temp': 32}
Mar 19 14:27:35 {'CPU': 1, 'Temp': 32}
Mar 19 14:27:38 {'CPU': 0, 'Temp': 34}
Mar 19 14:27:40 {'CPU': 0, 'Temp': 33}
Mar 19 14:27:43 {'CPU': 0, 'Temp': 31}
Mar 19 14:27:45 {'CPU': 0, 'Temp': 32}
Mar 19 14:27:48 {'CPU': 1, 'Temp': 33}
Mar 19 14:27:50 {'CPU': 0, 'Temp': 37}
Mar 19 14:27:53 {'CPU': 0, 'Temp': 34}
Mar 19 14:27:55 {'CPU': 1, 'Temp': 32}
Mar 19 14:27:58 {'CPU': 0, 'Temp': 32}
Suggestions
-
If one reading changes, send all four of the readings even if they are not significantly different. This would help CPU usage with the linkage to temperature. I would not expect a significant issue on the server with this change.
-
Might be better to use relative change in value rather than absolute bucket boundaries to determine if a change is significant. For example if last published memory usage reading was 324, then a significant reading must be < 314 or > 334. This could reduce nework traffic to the server that would be rejected anyway due to frequency. Alternatively, use a hysteresis check like, "must cross bucket boundary twice in a row to be considered changed to new bucket"
-
Maybe increase the bucket size for cpu_temp due to inherent jitter observed in temperature readings.
-
[2022-06-03] Allow for some communication mechanism with the server so the device knows the most recent metric values retained by the server. It can then include these values to determine when a change in metric reading is significant and should be sent.
[kb2ma] This issue has attached support thread https://jel.ly.fish/a169355f-dfc1-4b01-b456-9145bd8ab0e2
send all four of the readings even if they are not significantly different
This was done to save bandwidth since the device thinks the other values have not changed enough to consider them "changed" values so the reporting only sends what is different.
I love the other suggestions regarding the bucket policies. I have had some suspicions that they should be re-evaluated so this is a good nudge for that.
This was done to save bandwidth since the device thinks the other values have not changed enough to consider them "changed" values so the reporting only sends what is different.
The problem though is that the server's timeout period is much longer and works at the granularity of the device, not the individual metric. As the example above shows, the server often tosses a changed metric B for a device because it recently accepted metric A for the device. I definitely saw this issue with CPU usage.
Certainly any updates to functionality could be sequenced over time though. Maximize bang for the buck, and once the results are good enough, just keep any other updates in reserve for later.
[chrisys] This issue has attached support thread https://jel.ly.fish/f7236045-00ba-42c6-a633-826a5ed47b47
[fisehara] This issue has attached support thread https://jel.ly.fish/417684d8-459c-48a1-9da7-b0cd43a7149f
[jaomaloy] This issue has attached support thread https://jel.ly.fish/a820d415-bebc-4b11-b22c-30c8176cd3f2
[jmakivic] This has attached https://jel.ly.fish/393ec1de-9131-4fad-acfd-945ed1c550e7
Please see this thread for a proposed solution to this issue.
[cywang117] This has attached https://jel.ly.fish/e27ca749-1b74-4f81-9c00-9967d641cd10
Partially fixed by #2049.
@kb2ma what would you say is missing to fix for this issue? I guess it's related to experimenting with the buckets. Do you think you can create sub-tasks for this issue so we can keep track of progress somehow?
PR #2049 mentioned above was released as Supervisor v14.4.1. It indirectly addressed Suggestion 4 in the issue description about coordination of communication between device and server. Supervisor now will not send a target state containing only metrics to the server sooner than 5 minutes since the previous report. This interval matches minimum interval that the server expects for a metrics-only report. So, the state of metrics on the server now matches the state on the device.
Suggestions 2 and 3 relate to bucket size and boundaries. Unfortunately, it is difficult to generically optimize bucket size because the optimization depends on the metric patterns in the application and hardware. For example CPU load may normally oscillate from low to high on some period. In addition, given the 5 minute latency in metric readings to the server, it may be more useful to show an average over time.
Suggestion 1 points to a generic solution -- send all of the metric readings on some interval, and collect historical readings on the server. This approach allows the user to observe patterns in the readings to determine when the readings deviate from the expected pattern.