etcd
etcd copied to clipboard
Memory leak with distributed tracing enabled
What happened?
Adding --experimental-enable-distributed-tracing works, but causes a memory leak, of about 1GB per hour in our setup. Instead of expected ~2GB, it got to about 12GB in 7 hours.
What did you expect to happen?
Stable memory usage around 1.8-2.0 GB of RSS.
How can we reproduce it (as minimally and precisely as possible)?
Run with --experimental-enable-distributed-tracing for few hours. It is sufficient to enable it on one member.
Anything else we need to know?
The tracing collector endpoint doesn't need to be configured or listening. Having otelcol on 4317 doesn't change anything (beyond actually making tracing work).
Etcd version (please run commands below)
$ etcd --version
etcd Version: 3.5.0
Git SHA: f99cada05
Go Version: go1.16.6
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5
Etcd configuration (command line flags or environment variables)
etcd, Kubernetes, OKD / Openshift 4.9, 3 members.
etcd --experimental-enable-distributed-tracing --logger=zap --log-level=info --initial-advertise-peer-urls=https://10.10.0.102:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com..crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.10.0.102:2379 --listen-client-urls=https://0.0.0.0:2379,unixs://10.10.0.102:0 --listen-peer-urls=https://0.0.0.0:2380 --metrics=extensive --listen-metrics-urls=https://0.0.0.0:9978
running in cri-o
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
[root@master-1 /]# etcdctl member list -w table
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
| 10f8cf6269xxx | started | master-2.example.com | https://10.10.0.103:2380 | https://10.10.0.103:2379 | false |
| a2bbe7149xxx | started | master-1.example.com | https://10.10.0.102:2380 | https://10.10.0.102:2379 | false |
| acb2c160xxx | started | master-0.example.com | https://10.10.0.101:2380 | https://10.10.0.101:2379 | false |
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
Relevant log output
No fatal issues in the logs.
Memory usage (master-1 is the member with tracing enabled around 1:00 PM CEST on the graph - middle of the time line there):

pprof heap snapshots without and with tracing enabled taken ever 1 hours (starting just after starting etcd) for 7 hours:
etcd_pprof_issue_13990.tar.gz
What happened?
Adding
--experimental-enable-distributed-tracingworks, but causes a memory leak, of about 1GB per hour in our setup. Instead of expected ~2GB, it got to about 12GB in 7 hours.What did you expect to happen?
Stable memory usage around 1.8-2.0 GB of RSS.
How can we reproduce it (as minimally and precisely as possible)?
Run with
--experimental-enable-distributed-tracingfor few hours. It is sufficient to enable it on one member.Anything else we need to know?
The tracing collector endpoint doesn't need to be configured or listening. Having
otelcolon 4317 doesn't change anything (beyond actually making tracing work).Etcd version (please run commands below)
$ etcd --version etcd Version: 3.5.0 Git SHA: f99cada05 Go Version: go1.16.6 Go OS/Arch: linux/amd64 $ etcdctl version etcdctl version: 3.5.0 API version: 3.5Etcd configuration (command line flags or environment variables)
etcd, Kubernetes, OKD / Openshift 4.9, 3 members.
etcd --experimental-enable-distributed-tracing --logger=zap --log-level=info --initial-advertise-peer-urls=https://10.10.0.102:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com..crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.10.0.102:2379 --listen-client-urls=https://0.0.0.0:2379,unixs://10.10.0.102:0 --listen-peer-urls=https://0.0.0.0:2380 --metrics=extensive --listen-metrics-urls=https://0.0.0.0:9978running in
cri-oEtcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
[root@master-1 /]# etcdctl member list -w table +------------------+---------+------------------------------+--------------------------+--------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+------------------------------+--------------------------+--------------------------+------------+ | 10f8cf6269xxx | started | master-2.example.com | https://10.10.0.103:2380 | https://10.10.0.103:2379 | false | | a2bbe7149xxx | started | master-1.example.com | https://10.10.0.102:2380 | https://10.10.0.102:2379 | false | | acb2c160xxx | started | master-0.example.com | https://10.10.0.101:2380 | https://10.10.0.101:2379 | false | +------------------+---------+------------------------------+--------------------------+--------------------------+------------+Relevant log output
No fatal issues in the logs.
$ etcd --version
etcd Version: 3.5.0
Git SHA: f99cada05
I can't find the GIT SHA you provided in etcd-io/etcd git log tree. Is the etcd contains your custom commits?
I can't find the GIT SHA you provided in etcd-io/etcd git log tree. Is the etcd contains your custom commits?
No custom commits by me. This is the etcd distributed as part of OKD v1.22.1-1839. I am not familiar with their build to see what changes (if any) were there.
I can't find the GIT SHA you provided in etcd-io/etcd git log tree. Is the etcd contains your custom commits?
No custom commits by me. This is the etcd distributed as part of OKD
v1.22.1-1839. I am not familiar with their build to see what changes (if any) were there.
It looks like Openshift customized the etcd? @hexfusion could you confirm this?
I just downloaded the official 3.5.0, and did a quick verification below.
$ ./etcd --version
etcd Version: 3.5.0
Git SHA: 946a5a6f2
Go Version: go1.16.3
Go OS/Arch: linux/amd64
$ git log --pretty=oneline | grep 946a5a6f2
946a5a6f25c3b6b89408ab447852731bde6e6289 version: 3.5.0
cc @lilic
It looks like Openshift customized the etcd? @hexfusion could you confirm this?
I can confirm its not an upstream binary, this is the downstream repo the build comes from[1]. The changes would be minimal to etcd itself. 3.5.0 uses a pretty old version of otel (pre v1) so its possible that they had a bug as well.
[1] https://github.com/openshift/etcd
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Required to graduate the distributed tracing.
We have already bumpped the otel to 1.0.1 in https://github.com/etcd-io/etcd/pull/14312. @baryluk Could you please double check whether you can still see this issue? thx
We have already bumpped the otel to 1.0.1 in #14312. @baryluk Could you please double check whether you can still see this issue? thx
Sure, I can try on Monday to test it.
ping @baryluk
Hey @baryluk, can you confirm that issue was addressed? You closed the issue as "not planned" so I wanted to double check.
cc @dashpole
@serathius I was not able to reproduce the issue.
Great, closing issue as fixed. Thanks for looking into this.