apisix help request: My APISIX is experiencing 100% CPU utilization and has become unresponsive

Description

When the free_size of Prometheus in APISIX's share_dict becomes 0, will it trigger a 100% CPU exception? This issue occurred in 2 out of 38 instances. Even after removing the traffic, the CPU remains fully occupied. This is what was observed using perf tools and flamegraph.
ngx_shmtx_lock ngx_shmtx_unlock

strace信息： perf top信息：

We have mounted too many routes, so the metrics are very large. The currently allocated memory is 100M. what should i do ? scale prometheus share dict?

Environment

apisix :2.15 Linux 4 3.1-1160.92.1.el7.x86_64 #1 SMP Tue Jun 20 11:48:01 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux openresty/1.21.4.2

May 31 '25 13:05 zhaoqiang1980

add etcd version ：3.5.0

May 31 '25 13:05 zhaoqiang1980

Hi because your APISIX instance is too old (2.15), I can't provide any clues/solutions. If you can try to reproduce in the latest version, I can try to help :)

May 31 '25 13:05 juzhiyuan

Hi because your APISIX instance is too old (2.15), I can't provide any clues/solutions. If you can try to reproduce in the latest version, I can try to help :)

In version 3.x, the dashboard has been integrated into the API. However, since we have enhanced our dashboard, upgrading to version 3 would require significant modifications. and we have more time and people to do this.

I also want to upgrade, but the gateway is used by the entire company, so we can't upgrade it casually. We can't push for the upgrade either, as there's no funding or time allocated to us. please tell me what should i do？Thank you~~~

May 31 '25 15:05 zhaoqiang1980

In principle, we do not provide support for older versions, especially for spans such as 2.x to 3.x. If the same issue cannot be reproduced on the master branch, it will basically not be addressed (or will be addressed with very low priority). Generally, you may need to build an independent test environment using the master code repository or at least the latest release version, and conduct testing there. If the issue is reproduced, you can report that it should be fixed.

However, regarding the abnormal CPU consumption issue you mentioned caused by Prometheus, I have heard a little about it. This seems to be caused by some unexpected implementation or integration within the library, which is not easy to fix. The most economical and quick improvement is to first provide a larger shdict buffer.

You are using an extremely outdated version. Since the version you are using, we have introduced some improvements to the plugin, such as a mechanism for evicting old data (in the old implementation, shdict keys did not have a TTL, which allowed data to accumulate indefinitely). Although abnormal CPU consumption may still occur if configured improperly (due to internal library details), the presence of shdict TTL can protect against most scenarios.

If you cannot upgrade, you will not be able to benefit from these improvements. If they are necessary, you will need to explore how to backport them to your version on your own, but this essentially remains a customization and will not receive official support, so experience is required.

Jun 02 '25 12:06 bzp2010

@zhaoqiang1980 Thanks for the detailed report!

This is actually a known issue. It tends to occur more frequently when the shared dictionary (prometheus-metrics) is configured with a relatively small size. Under high concurrency and with a large number of metrics, the shared dict becomes a hotspot and introduces lock contention.

The most straightforward mitigation is to increase the size of the shared dict to reduce contention.

I think a more robust solution would be to implement a graceful degradation mechanism in the prometheus plugin. For example, when it detects that the shared memory is full and lock contention is impacting performance, it could temporarily pause metrics collection for 5 minutes. This may result in some metrics loss, but would prevent the CPU from hitting 100% and affecting overall system stability.

We’d love to hear what others think about this approach.

Jun 03 '25 02:06 moonming

@zhaoqiang1980 Thanks for the detailed report!

This is actually a known issue. It tends to occur more frequently when the shared dictionary (prometheus-metrics) is configured with a relatively small size. Under high concurrency and with a large number of metrics, the shared dict becomes a hotspot and introduces lock contention.

The most straightforward mitigation is to increase the size of the shared dict to reduce contention.

I think a more robust solution would be to implement a graceful degradation mechanism in the prometheus plugin. For example, when it detects that the shared memory is full and lock contention is impacting performance, it could temporarily pause metrics collection for 5 minutes. This may result in some metrics loss, but would prevent the CPU from hitting 100% and affecting overall system stability.

We’d love to hear what others think about this approach.

@moonming Thanks for your response. we would increase the size of the shared dict. Otherwise we will attempt to introduce the higher version of the TTL capability into the current version to try to address the issue where metrics only increase and do not decrease. and thanks @bzp2010

Jun 03 '25 03:06 zhaoqiang1980