apisix icon indicating copy to clipboard operation
apisix copied to clipboard

memory leak when enable prometheus as global

Open suninuni opened this issue 4 years ago • 20 comments

Issue description

When the prometheus global plugin is enabled, the memory of apisix continues to grow abnormally.

Environment

  • apisix version (cmd: apisix version): 2.7

Steps to reproduce

  • deploy httpbin service in k8s (step in apisix-inrgess-controller)
  • enable prometheus as global rule and set perfer_name as true (lua_shared_dict.prometheus-metrics is 10m)
  • mock http request throw ab with high concurrency ab -n 2000000 -c 2000 http://xxxxx

Actual result

  • memory keep growing

    image

Error log

  • *45077733 [lua] prometheus.lua:860: log_error(): Unexpected error adding a key: no memory while logging request,

Expected result

  • The memory should not continue to grow.
  • Prometheus should release a part of the memory allocated to it after it is used up to avoid continuous errors.

suninuni avatar Sep 18 '21 03:09 suninuni

releated https://github.com/apache/apisix/issues/3917#issuecomment-921736339

@tzssangglass FYI

suninuni avatar Sep 18 '21 03:09 suninuni

image

It's keep growing, FYI.

suninuni avatar Sep 18 '21 07:09 suninuni

First of all, we need to check if the error log is caused by the memory used by Prometheus is unbound or just because the default size is too small.

The configuration of lua shdict is not available until 2.8, so you may need to modify the ngx_tpl.lua: https://github.com/apache/apisix/pull/4524

If the Prometheus memory usage grows without limit, it will consume all the memory configurated. Otherwise, the memory usage stops growing in a level.

We also need to compare the Prometheus metrics before/after the http request. Prometheus client is a well-known memory consumer. How many metrics/labels are there in the Prometheus metrics?

You can also use X-Ray to diagnose the memory issue: https://openresty.com.cn/cn/xray/. Note that I am not the developer of X-Ray (it is a commercial product developed by others).

spacewander avatar Sep 18 '21 08:09 spacewander

The configuration of lua shdict is not available until 2.8, so you may need to modify the ngx_tpl.lua

We are now deploying apisix in the test environment, so I will upgrade apisix to 2.9 and retest in next week.

How many metrics/labels are there in the Prometheus metrics?

I counted the number of metrics rows in two pods with different memory consumption. Is this magnitude acceptable?

memory 763m: 18869

image

memory 596m: 17924

image

suninuni avatar Sep 18 '21 09:09 suninuni

IMO, this is not normal, but it still depends on the size of the lua share dict assigned to prometheus in your runtime nginx.conf, and the number of concurrent requests. I suggest that you try with X-Ray that @spacewander mentioned.

tzssangglass avatar Sep 19 '21 15:09 tzssangglass

@suninuni the latest version of APISIX is 2.11 now, can you make a try with APISIX 2.11? We can not sure if we have fixed this issue.

If you still have this problem, please let us know.

membphis avatar Dec 23 '21 09:12 membphis

@suninuni the latest version of APISIX is 2.11 now, can you make a try with APISIX 2.11? We can not sure if we have fixed this issue.

If you still have this problem, please let us know.

The same error log as before.

2021/12/30 02:36:43 [error] 50#50: *20414229 [lua] prometheus.lua:860: log_error(): Unexpected error adding a key: no memory while logging request

And now I try to set lua_shared_dict.prometheus-metrics as 100m, FYI.

Although not mentioned before, I also set up dict before upgrading and also encountered the memory leak problem, FYI.

suninuni avatar Dec 30 '21 02:12 suninuni

And now I try to set lua_shared_dict.prometheus-metrics as 100m

It seems that the memory does not increase after it has increased to a certain value, and I have not found no memory while logging request log any more.

image

So it seems that this problem no longer exists with APISIX 2.11, I will continue to observe. Thanks for your support!

suninuni avatar Dec 30 '21 06:12 suninuni

Unexpected error adding a key

So the memory increasing is normal and the no memory issue is just because the pre-defined lua shared dict too small?

tokers avatar Dec 30 '21 09:12 tokers

Unexpected error adding a key

So the memory increasing is normal and the no memory issue is just because the pre-defined lua shared dict too small?

After a period of stability, memory has being continue increasing until oom...

image

But there is no nginx metrics errors like Unexpected error adding a key.

suninuni avatar Jan 03 '22 23:01 suninuni

@tokers @membphis FYI

suninuni avatar Jan 05 '22 08:01 suninuni

So it seems that this problem no longer exists with APISIX 2.11, I will continue to observe. Thanks for your support!

do you update your apisix to 2.11?

membphis avatar Jan 05 '22 10:01 membphis

So it seems that this problem no longer exists with APISIX 2.11, I will continue to observe. Thanks for your support!

do you update your apisix to 2.11?

yes

image

suninuni avatar Jan 06 '22 01:01 suninuni

image

Some monitoring datas which may be helpful.

suninuni avatar Jan 06 '22 02:01 suninuni

Finally, I solved this problem by removing the node label (balance_ip) for the metrics in exporter.lua.

The root cause is that the key saved in the shared dict contains all labels, such as idx=__ngx_prom__key_35279, key=http_status{code="499",route="xxxx",matched_uri="/*",matched_host="xxxxx ",service="",consumer="",node="10.32.47.129"}. In a k8s cluster, or in a k8s cluster that the deployments are frequently updated, the information of nodes is always changing, resulting in dict always growing.

Maybe we can change the node label as extra_labels and let the user decide if this label is needed.

suninuni avatar Nov 15 '22 09:11 suninuni

Finally, I solved this problem by removing the node label (balance_ip) for the metrics in exporter.lua.

The root cause is that the key saved in the shared dict contains all labels, such as idx=__ngx_prom__key_35279, key=http_status{code="499",route="xxxx",matched_uri="/*",matched_host="xxxxx ",service="",consumer="",node="10.32.47.129"}. In a k8s cluster, or in a k8s cluster that the deployments are frequently updated, the information of nodes is always changing, resulting in dict always growing.

Maybe we can change the node label as extra_labels and let the user decide if this label is needed.

Indeed, that's a question when users deploy Apache APISIX on Kubernetes.

tokers avatar Nov 15 '22 09:11 tokers

@suninuni How did you remove the node label. Are you using a custom build of apisix?

va3093 avatar Jun 06 '23 09:06 va3093

I fixed this by adding the following module_hook

local apisix = require("apisix")

local old_http_init = apisix.http_init
apisix.http_init = function (...)
    ngx.log(ngx.EMERG, "Module hooks loaded")
    old_http_init(...)
end

local exporter = require("apisix.plugins.prometheus.exporter")
local old_http_log = exporter.http_log
exporter.http_log = function (conf, ctx)
    ctx.balancer_ip = "_overwritten_"
    old_http_log(conf, ctx)
end

I put that in a config map and loaded it into the apisix config using the following config in my helmchart values.yml file.

  luaModuleHook:
    enabled: true
    luaPath: "/usr/local/apisix/apisix/module_hooks/?.lua"
    hookPoint: "module_hook"
    configMapRef:
      name: "apisix-module-hooks"
      mounts:
        - key: "module_hook.lua"
          path: "/usr/local/apisix/apisix/module_hooks/module_hook.lua"

va3093 avatar Jun 13 '23 08:06 va3093

I observed that it was actually Apisix_ HTTP_ Latency_ There are too many bucket indicators that are constantly increasing, causing memory to continue to rise image Apisix_ HTTP_ Status is not particularly abundant image

Apisix_ Nginx_ Metric_ Errors_ There are many errors in the total indicator, and the log reports errors

content:2023/07/06 15:41:08 [error] 48#48: *972511102 [lua] init.lua:187: http_ssl_phase(): failed to fetch ssl config: failed to find SNI: please check if the client requests via IP or uses an outdated protocol. If you need to report an issue, provide a packet capture file of the TLS handshake., context: ssl_certificate_by_lua*, client: 109.237.98.226, server: 0.0.0.0:9443

image How should I optimize and solve this problem

susugo avatar Jul 07 '23 03:07 susugo

This issue has been marked as stale due to 350 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Jun 21 '24 10:06 github-actions[bot]

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

github-actions[bot] avatar Jul 05 '24 10:07 github-actions[bot]