feat: capture system resources
Overview
Closes https://opentrons.atlassian.net/browse/EXEC-370
Create a service to capture system resource usage.
The service uses the psutil library to query running processes, filters them to ones we care about, polls CPU and memory usage metrics, and stores them as csv on the robot.
In future PRs the service will be implemented as a systemd service on the robot. Configuration will be set in the API project by setting environment variables.
Test Plan
- [/] Unit testing
- [x] Manual testing
- [x] Push performance_metrics to robot
- [x] Run
python -m performance_metrics.system_resource_trackerfor like 30 seconds. - [x] Check
/data/performance_metrics_data/system_resource_data_headersand/data/performance_metrics_data/system_resource_datato make sure they both have data that makes sense
Changelog
- Add _SystemResourceTracker which provides an API into retrieving system metrics. Class is intentionally private as main.py should be the only thing that uses _SystemResourceTracker when running the tracker as a systemd service
- Define a journal logging config so logs can be retrieved the same way as other projects
- Add main.py script which runs _SystemResourceTracker as a polling service
- Add ProcessResourceUsageSnapshot dataclass to define storage shape for captured system metrics
- Add tests
Review requests
None
Risk assessment
Low, in order to run any of this you have to manually push performance_metrics and trigger the service script
Would it make sense to implement this as a systemd service running on the robot?
Yes, I think it makes sense to run this periodically as a systemd service. What frequency were you thinking?
I was going to try every second and see what the performance impact was. But with your suggestion of capturing the CPU seconds, it likely doesn't need to be that frequent.
If that happened on its own, something's definitely going wrong and we need to look into it.
If that happened when you hit Ctrl+C to stop the program, it's harmless but we should probably do something to make it less confusing. I guess the normal way to handle that is to catch KeyboardInterrupt at the top level.
When I manually tested, I got:
2024-07-03 18:57:16,258 - __main__ - INFO - Starting system resource tracker... ^C2024-07-03 18:57:33,090 - __main__ - INFO - System resource tracker is stopping. Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/opentrons-robot-server/performance_metrics/system_resource_tracker/__main__.py", line 25, in <module> time.sleep(tracker.refresh_interval) KeyboardInterruptis the traceback expected?
@skowalski08, fixed with https://github.com/Opentrons/opentrons/pull/15542/commits/89c687eaa957c1758686d6e45930569893000d1f