opentrons icon indicating copy to clipboard operation
opentrons copied to clipboard

feat: capture system resources

Open DerekMaggio opened this issue 1 year ago • 2 comments

Overview

Closes https://opentrons.atlassian.net/browse/EXEC-370

Create a service to capture system resource usage.

The service uses the psutil library to query running processes, filters them to ones we care about, polls CPU and memory usage metrics, and stores them as csv on the robot.

In future PRs the service will be implemented as a systemd service on the robot. Configuration will be set in the API project by setting environment variables.

Test Plan

  • [/] Unit testing
  • [x] Manual testing
    • [x] Push performance_metrics to robot
    • [x] Run python -m performance_metrics.system_resource_tracker for like 30 seconds.
    • [x] Check /data/performance_metrics_data/system_resource_data_headers and /data/performance_metrics_data/system_resource_data to make sure they both have data that makes sense

Changelog

  • Add _SystemResourceTracker which provides an API into retrieving system metrics. Class is intentionally private as main.py should be the only thing that uses _SystemResourceTracker when running the tracker as a systemd service
  • Define a journal logging config so logs can be retrieved the same way as other projects
  • Add main.py script which runs _SystemResourceTracker as a polling service
  • Add ProcessResourceUsageSnapshot dataclass to define storage shape for captured system metrics
  • Add tests

Review requests

None

Risk assessment

Low, in order to run any of this you have to manually push performance_metrics and trigger the service script

DerekMaggio avatar Jun 27 '24 16:06 DerekMaggio

Would it make sense to implement this as a systemd service running on the robot?

DerekMaggio avatar Jun 28 '24 13:06 DerekMaggio

Yes, I think it makes sense to run this periodically as a systemd service. What frequency were you thinking?

I was going to try every second and see what the performance impact was. But with your suggestion of capturing the CPU seconds, it likely doesn't need to be that frequent.

DerekMaggio avatar Jul 01 '24 17:07 DerekMaggio

If that happened on its own, something's definitely going wrong and we need to look into it.

If that happened when you hit Ctrl+C to stop the program, it's harmless but we should probably do something to make it less confusing. I guess the normal way to handle that is to catch KeyboardInterrupt at the top level.

SyntaxColoring avatar Jul 05 '24 15:07 SyntaxColoring

When I manually tested, I got: 2024-07-03 18:57:16,258 - __main__ - INFO - Starting system resource tracker... ^C2024-07-03 18:57:33,090 - __main__ - INFO - System resource tracker is stopping. Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/opentrons-robot-server/performance_metrics/system_resource_tracker/__main__.py", line 25, in <module> time.sleep(tracker.refresh_interval) KeyboardInterrupt

is the traceback expected?

@skowalski08, fixed with https://github.com/Opentrons/opentrons/pull/15542/commits/89c687eaa957c1758686d6e45930569893000d1f

DerekMaggio avatar Jul 08 '24 13:07 DerekMaggio