foundationdb
foundationdb copied to clipboard
add in uptime_seconds to machine level fields for `status json`
fixes #4411
Changes:
-
Added the machine levels uptime to the
status jsoncommand. Thus making it possible to determine even through process restarts how long a particular machine has been online.- This was done by adding to
MachineMetricsa field calledUptimeSeconds(which is calculated the same as otherwise in that file), and then being accessed in the status serializer.
- This was done by adding to
Testing Done:
-
Didn't add any unit tests as this isn't actually a new metric, it's just being exposed in status json now. So I didn't feel one was necessary for an update to the schema (there also didn't seem to be any existing tests for the schema that I could find so I felt this was fine).
-
Manually tested, and validated that
status jsonshows the correct machine uptime:
Start a cluster with 3 processes on the same machine, running `status json` all have the same uptime
fdb> status json
{
[...]
"cluster" : {
[...]
"machines" : {
"c7ae2fc5440f252bb97cb5b3f68c959d" : {
[...]
"uptime_seconds" : "30.0003"
}
},
"processes" : {
"23095e2afdc28433b1377e317aa573a3" : {
"address" : "127.0.0.1:4500",
[...]
"uptime_seconds" : 30.000299999999999,
"version" : "7.0.0"
},
"8200471ea029b35e3272efc8699552b1" : {
"address" : "127.0.0.1:4501",
[...]
"uptime_seconds" : 30.000299999999999,
"version" : "7.0.0"
},
"d18370147949674a289b4e14fc665800" : {
"address" : "127.0.0.1:4502",
[...]
"uptime_seconds" : 30.000299999999999,
"version" : "7.0.0"
}
},
[...]
}
After restarting two of the processes: machine, and one process match the original boot time. While the other two process times are new
fdb> status json
{
[...]
"cluster" : {
[...]
"machines" : {
"c7ae2fc5440f252bb97cb5b3f68c959d" : {
"address" : "127.0.0.1",
"contributing_workers" : 3,
[...]
"uptime_seconds" : "355.004"
}
},
[...]
"processes" : {
"0898cc149d02203f24ff1326f147c351" : {
"address" : "127.0.0.1:4502",
[...]
"uptime_seconds" : 45.000599999999999,
"version" : "7.0.0"
},
"23095e2afdc28433b1377e317aa573a3" : {
"address" : "127.0.0.1:4500",
[...]
"uptime_seconds" : 355.00400000000002,
"version" : "7.0.0"
},
"8bbf837b75079c46f3c4cf59c86de04b" : {
"address" : "127.0.0.1:4501",
[...]
"uptime_seconds" : 45.000700000000002,
"version" : "7.0.0"
}
},
[...]
Code-Reviewer Section
The general guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
- [ ] The PR has a description, explaining both the problem and the solution.
- [ ] The description mentions which forms of testing were done and the testing seems reasonable.
- [ ] Every function/class/actor that was touched is reasonably well documented.
This is computing the uptime in exactly the same way that process uptime is being computed, and therefore won't actually get a different result. MachineMetrics will be reporting the exact same uptime as ProcessMetrics. In status I think it's just going to give you an arbitrary (or possibly a deterministic) process uptime from one of the processes on that machine, but it won't necessarily represent the longest process.
One thing that's unclear to me is what this uptime is intended to be measuring. Is it supposed to be the number of seconds for the longest running fdbserver process on the machine? Or is it rather the amount of time that the host itself has been running? Machine metrics has usually been used to report host statistics independent of FDB, so if we are going to be adding something to track the longest process uptime, I would suggest renaming this field to make that very clear. Otherwise, the more natural interpretation in this context would be that it's reporting the host uptime.
If you are trying to report the longest uptime of an fdbserver process, then my suggestion would be to not add this field to MachineMetrics at all and instead take the max of the values in ProcessMetrics when you are creating the status document. Then you could rename the field to something like max_process_uptime_seconds.
If you are trying to get the host uptime, then that could go in MachineMetrics, you would need to add some other mechanism to capture this information based on the OS.
Hey @sfc-gh-ajbeamon , I appreciate the review! This is admittedly probably a fault on my part as I was looking for low hanging fruit issues to start off learning the codebase. I saw: If I understand all code pieces correctly we should already have everything to expose this information which would probably require some small changes., and connected in my head "machineState has a field called uptime, that's probably what was meant". Sorry for the extra work needed to review this incomplete 😅
Looking deeper though my understanding is (though I'll wait for @johscheuer to confirm) is that the goal is to count the time any process has been running on this machine. So neither the host uptime, nor the longest running process. In which case I think some tracking would need to be added to: https://github.com/apple/foundationdb/blob/da41534618a2a1edbf6b0b760635175372a66294/flow/Platform.actor.cpp#L1489 (potentially keeping track of pids, and their runtime?) However I'll wait for a confirmation, and take either one of your two ideas, or think of something if it's the way I interpreted the issue.
Looking deeper though my understanding is (though I'll wait for @johscheuer to confirm) is that the goal is to count the time any process has been running on this machine
If that ends up being the goal, that sounds like a tricky task. Unless you are making this data durable somewhere (e.g. recording it in the database), then no process is going to have any knowledge of what fdbserver processes were running prior to the start of your longest running process. My intuition is that host uptime would be a fairly good replacement for something like that while being much easier to determine.
The idea was to simply report the host uptime. So for Linux that would be probably reading sysinfo and just use uptime.