[SAMZA-2800] A new Control Group Metric for Samza

Open li-afaris opened this issue 8 months ago • 0 comments

Introduction

Hadoop clusters have the ability to restrict CPU usage for Samza applications by utilizing Control Groups, (Cgroups). Before enabling CPU enforcement on Hadoop clusters, application owners must have a way of knowing when their application is being throttled by Cgroups. This PR will add a new Cgroup metric that makes application owners aware if containers CPU usage is being throttled by control groups & whether the application needs to request additional resources.

Implementation

The Linux kernel reports when applications within a Cgroup has been throttled by writing values to a file named cpu.stat. cpu.stat contains two fields named nr_periods & nr_throttled. nr_periods represents the number of enforcement periods that elapsed. nr_thorttled represents the number of times the group has been throttled. We can treat these fields as a ratio that shows the number of times applications has been throttled over a number of enforcement periods. The proposal is to have the running container locate the cpu.stat file by reading property values from Hadoop's YARN config.

Implementation details

To limit high cardinality in the metrics storage layer, instead of using the Hadoop YARN container id, the metric will emit the Samza container ID as the hostname, (ie: Container 3). This is already supported by the existing metrics framework within Samza.
The container will emit a float value between zero and 1 as a gauge metric. A zero value means the Cgroup was not throttled for that period of time. A value of 1 means the Cgroup was unable to complete any work as it was persistently throttled.
To stay consistent with existing metrics, a negative value (-1) will be emitted if an exception is thrown when reading the cpu.stat file. Exceptions when reading cpu.stat will be logged to the container logs.
This implementation will be specific to Samza on Hadoop. The reasoning is the application itself should emit this metric, not the embedded library.

Considered Alternatives

I’m unaware of alternatives but reading values from cpu.stat is a pattern which appears in the Runc project. Runc is the underlying library for ContainerD which is used by both Docker & Kubernetes.

The metric needs to be emitted from the Samza container itself. Using a system daemon or sidecar application complicates deployments & creates data consistency issues when the sidecar process isn’t running.

External references

Linux kernel documentation on the cpu.stat file
cpu.stat references from the Open Container Initiative runc project.

Jun 03 '24 19:06 li-afaris

samza samza copied to clipboard

[SAMZA-2800] A new Control Group Metric for Samza

Introduction

Implementation

Implementation details

Considered Alternatives

External references

samza
samza copied to clipboard