pulsar [feature][broker] Support cgroup v2 by using `jdk.internal.platform.Metrics` in Pulsar Loadbalancer

Master Issue: #16601

Motivation

The Pulsar load balancer detects CPU limits using cgroup v1 API, and the jdk.internal.platform.Metrics already support cgroup (V1, v2) so we should use jdk.internal.platform.Metrics to get the cgroup metrics.

Reference: https://code.yawk.at/java/17/java.base/jdk/internal/platform/

Modifications

Use jdk.internal.platform.Metrics to get the cgroup metrics in the LinuxInfoUtils.

Verifying this change

[x] Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as testCGroupMetrics.

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API: (yes / no)
The schema: (yes / no / don't know)
The default values of configurations: (yes / no)
The wire protocol: (yes / no)
The rest endpoints: (yes / no)
The admin cli options: (yes / no)
Anything that affects deployment: (yes / no / don't know)

Documentation

Check the box below or label this PR directly.

Need to update docs?

[ ] doc-required (Your PR needs to update docs and you will update later)
[x] doc-not-needed (Please explain why)
[ ] doc (Your PR contains doc changes)
[ ] doc-complete (Docs have been already added)

Jul 28 '22 03:07 coderzc

/pulsarbot run-failure-checks

Jul 29 '22 03:07 coderzc

/pulsarbot run-failure-checks

Aug 04 '22 07:08 coderzc

/pulsarbot run-failure-checks

Aug 11 '22 02:08 coderzc

I don't understand something. You're using an internal class from the JDK - it's internal - you're not supposed to use it as is no? How is the JDK itself using this - and maybe we can this info in a more public API way? Can you describe in more detail the motivation to get those metrics and why normal MBean metrics don't cut it?

Aug 15 '22 06:08 asafm

I don't understand something. You're using an internal class from the JDK - it's internal - you're not supposed to use it as is no? How is the JDK itself using this - and maybe we can this info in a more public API way? Can you describe in more detail the motivation to get those metrics and why normal MBean metrics don't cut it?

@asafm The OperatingSystemMXBean using jdk.internal.platform.Metrics to get CPU load. But OperatingSystemMXBean only exports CPU load, we also need the limit of CPU unfortunately, JDK did not export it. And, for the CPU usage we have our own calculation formula, which is not exactly the same as OperatingSystemMXBean#getCpuLoad.

Aug 15 '22 07:08 coderzc

And, for the CPU usage we have our own calculation formula, which is not exactly the same as OperatingSystemMXBean#getCpuLoad. Can you please explain why it is different? I do see that eventually, the load balancer needs the percentage right? how much was the CPU used relative to the limit which is 100%? getCPULoad of sun implementation does try to calculate that, no?

Aug 15 '22 12:08 asafm

Can you please explain why it is different? I do see that eventually, the load balancer needs the percentage right? how much was the CPU used relative to the limit which is 100%? getCPULoad of sun implementation does try to calculate that, no?

We need to record usage and limit, but OperatingSystemMXBean#getCpuLoad only is the percentage. We sometimes need to use them alone, such as: https://github.com/apache/pulsar/blob/6704f12104219611164aa2bb5bbdfc929613f1bf/pulsar-broker/src/main/java/org/apache/pulsar/broker/tools/LoadReportCommand.java#L110-L127

And for usage we calculate is a mean value, more please see: https://github.com/apache/pulsar/blob/380c7587d0cefdd763030f348246fef711bfd58c/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java#L139-L144

Aug 15 '22 15:08 coderzc

Ok, I did some reading and this is what I found:

Usage calculation in OperatingSystemMXBean which delegates to OperatingSystemImp gives you the percentage of usage of CPU relative to the limit imposed on the container group, as can be seen here:

            long quota = containerMetrics.getCpuQuota();
            long share = containerMetrics.getCpuShares();
            if (quota > 0) {
                long numPeriods = containerMetrics.getCpuNumPeriods();
                long quotaNanos = TimeUnit.MICROSECONDS.toNanos(quota * numPeriods);
                return getUsageDividesTotal(cpuUsageSupplier().getAsLong(), quotaNanos);

which calls

        private double getUsageDividesTotal(long usageTicks, long totalTicks) {
            // If cpu quota or cpu shares are in effect. Calculate the cpu load
            // based on the following formula (similar to how
            // getCpuLoad0() is being calculated):
            //
            //   | usageTicks - usageTicks' |
            //  ------------------------------
            //   | totalTicks - totalTicks' |
            //
            // where usageTicks' and totalTicks' are historical values
            // retrieved via an earlier call of this method.
            if (usageTicks < 0 || totalTicks <= 0) {
                return -1;
            }
            long distance = usageTicks - this.usageTicks;
            this.usageTicks = usageTicks;
            long totalDistance = totalTicks - this.totalTicks;
            this.totalTicks = totalTicks;
            double systemLoad = 0.0;
            if (distance > 0 && totalDistance > 0) {
                systemLoad = ((double)distance) / totalDistance;
            }
            // Ensure the return value is in the range 0.0 -> 1.0
            systemLoad = Math.max(0.0, systemLoad);
            systemLoad = Math.min(1.0, systemLoad);
            return systemLoad;
        }

So we can obtain the percentage of CPU used relative to CPU allocated to the container (container group). As you can see the code pasted above also does delta calculation, relative to last time it was called. It is the same as we do as can be see in Pulsar code here:

    private double getTotalCpuUsageForCGroup(double elapsedTimeSeconds) {
        double usage = getCpuUsageForCGroup();
        double currentUsage = usage - lastCpuUsage;
        lastCpuUsage = usage;
        return 100 * currentUsage / elapsedTimeSeconds / TimeUnit.SECONDS.toNanos(1);
    }

The main point here is that JDK does provide relative usage or as you said "mean" value.

The biggest difference I see is that when we report usage percent, we do so relative to the entire host: we take the CPU Usage for CGroup as reported by the operating system (measured in microseconds), only the delta from last time measured, and divide that by elapsed, so in effect, it is CPU used relative to the entire host.

    public static long getCpuUsageForCGroup() {
        try {
            if (metrics != null && getCpuUsageMethod != null) {
                return (long) getCpuUsageMethod.invoke(metrics);
            }

Which delegates to

    /**
     * Returns the aggregate time, in nanoseconds, consumed by all
     * tasks in the Isolation Group.
     *
     * @return Time in nanoseconds, -1 if unknown or
     *         -2 if the metric is not supported.
     *
     */
    public long getCpuUsage();

The limit we report is the percentage of CPU allocated, relative to the entire host.

It actually looked a bit odd to me: Why do we report usage relative to host and limit relative to host? Why not just report usage in percent relative to the allocated container limit?

So I looked at how we are using it, as you also mentioned above. So I saw those references:

            if (entry.getValue().limit > 0 && entry.getValue().limit > entry.getValue().usage) {
                Double temp = ((entry.getValue().limit - entry.getValue().usage) / entry.getValue().limit) * 100;
                percentAvailable = temp.intValue();
            }

In the above code you can see it doesn't really need it the percent to be relative to host nor does it need the limit. It just needs to say: CPU usage in percent relative to container allocated CPU limit < 100%, which the JDK gives.

           double cpuChange = (newUsage.cpu.limit > 0)
                            ? ((newUsage.cpu.usage - oldUsage.cpu.usage) * 100 / newUsage.cpu.limit)
                            : 0;

In the above code, you can see it only needs to know how much percent has changed - same thing - it can use the JDK load percent.

The biggest issue I have is only with the reporting

    private void printResourceUsage(String name, ResourceUsage usage) {
        spec.console().println(name + " : usage = " + usage.usage + ", limit = " + usage.limit);
    }

I don't really understand why would some need to know the limit. Why isn't the CPU percentage enough relative to the container allocation?

I understand all I mentioned here is a bit of technical debt to fix in this PR. I'm saying that we didn't mind not reporting the limit, we didn't need to be very operating system specific, and write Linux-based reporting classes and read the info of it from the file system and now through JVM internal classes. We could just use OperatingSystemMXBean.

@heesung-sn What do you think?

Aug 16 '22 06:08 asafm

Please note that I haven't found any other open source that uses jdk.internal.platform.Metrics - I tried searching using TabNine code search and also GitHub search.

Aug 16 '22 07:08 asafm

@heesung-sn What do you think?

According to this article, https://developers.redhat.com/articles/2022/04/19/java-17-whats-new-openjdks-container-awareness#recent_changes_in_openjdk_s_container_awareness_code

It seems like OperatingSystemMXBean already provides cpu usage in percentage relative to cgroup (for both v1 and v2) (I will call it as cpu_usage_in_percent_cgroup for our discussion)

I agree with you. From my understanding, load balancer cares about cpu_usage_in_percent_cgroup in the end for the cpu usage computation.

However, LB also requires other signals such as dict_memory_usage_in_percent_cgroup, network_usage_in_percent_cgroup, which require separate limits for their percentage computation.

As you pointed out, the problem is that LB uses the same generic code (requiring limit) to compute the resource percentage.

Maybe we can tweak the code to ignore the limit and use the signal as-is if they are already in the *_usage_in_percent_cgroup form.

Aug 16 '22 16:08 heesung-sohn

However, LB also requires other signals such as dict_memory_usage_in_percent_cgroup, network_usage_in_percent_cgroup, which require separate limits for their percentage computation.

Can you please expand on that? I don't any method providing usage and limit of direct memory in Pulsar. I do see usage and limit for physical memory (in the container) which as we said we can take from the mxBean which provides total and usage of memory.

Regarding network usage - yes, that I was surprised to see that the operating system didn't include, and we resort to reading operating system specific files to obtain that. I posted a question in the OpenJDK general discussion mailing to get the motivation for that ( I did search their JIRA issue database and also googled the mailing list site but found nothing).

Maybe we can tweak the code to ignore the limit and use the signal as-is if they are already in the *_usage_in_percent_cgroup form.

I guess that requires a separate PR. Maybe we can push that PR in but add an issue to refactor (delete it out) the dependency on limit and only use percentage (which should be enough) or refactor to avoid using limit and be more specific like ResourceUsage.CpuUsage, MemoryUsage, NetworkUsage, each providing it's own method, where the CPU would only provide percentage.

Aug 17 '22 08:08 asafm

Can you please expand on that? I don't any method providing usage and limit of direct memory in Pulsar. I do see usage and limit for physical memory (in the container) which as we said we can take from the mxBean which provides total and usage of memory.

I see this method is used to collect direct memory usage.

https://github.com/apache/pulsar/blob/535415302ef6d1a9017f6ec25b87b24afd081155/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LoadManagerShared.java#L228-L229

I guess that requires a separate PR. Maybe we can push that PR in but add an issue to refactor (delete it out) the dependency on limit and only use percentage (which should be enough) or refactor to avoid using limit and be more specific like ResourceUsage.CpuUsage, MemoryUsage, NetworkUsage, each providing it's own method, where the CPU would only provide percentage.

Yes, a separate PR makes sense to me.

Aug 17 '22 17:08 heesung-sohn

The pr had no activity for 30 days, mark with Stale label.

Sep 17 '22 02:09 github-actions[bot]

Any updates? I think this PR makes sense to use the cgroup v2.

Apr 26 '23 16:04 nodece

I will update this PR as soon.

Apr 27 '23 04:04 coderzc

I guess that requires a separate PR. Maybe we can push that PR in but add an issue to refactor (delete it out) the dependency on limit and only use percentage (which should be enough) or refactor to avoid using limit and be more specific like ResourceUsage.CpuUsage, MemoryUsage, NetworkUsage, each providing it's own method, where the CPU would only provide percentage.

Yes, we need a separate PR to refactor it.

Apr 28 '23 13:04 coderzc

We need this to be included in branch-3.0 and branch-2.11 asap. I backported the changes to branch-2.10 with all dependencies in #20659.

This is becoming urgent.

AKS Kubernetes 1.25+ switches to use cgroup v2: https://github.com/Azure/AKS/releases/tag/2023-03-05

AKS Kubernetes 1.24 goes End-of-life on July 30, 2023.

GKE contains to have a way to select between cgroup v1 & cgroup v2: https://cloud.google.com/kubernetes-engine/docs/how-to/node-system-config#cgroup-mode-options GKE will default to cgroup v2 in new Kubernetes 1.26 clusters or node pools. AWS EKS v1.26 nodes will default to cgroup v2.

Jun 27 '23 10:06 lhotari