envoy
envoy copied to clipboard
Suspected Envoy memory leak
Title: Suspected Envoy memory leak
Description: We are using Envoy in our network as an edge proxy for accessing backend services. All services proxied by Envoys are available for access for other internal services and some of these services need to be exposed externally, for clients outside of our network. Therefore, we have two instances of Envoy, "external" and "internal". Configuration for each of these instances are provided by corresponding "external" and "internal" instances of in-house built xDS service, which uses Consul for discovery of the backend nodes. Binary version of xDS service is same for "external" and "internal", ie. it uses the same logic to build Envoy configuration, the only difference is the number of services included in configuration, "external" vs "internal".
We monitor running Envoy instances, the metrics are being processed by Prometheus. We noticed the difference in memory utilization graphs, below you can see graphs showing memory utilization metrics for period of 7 days:
-
External
-
Internal
As you can see, the memory utilization graphs for "internal" instances show signs of memory leak while memory utilization for "external" instances does not show such signs.
Do you know about any issues which could possibly lead to such memory leaks?
Envoy Version: We currently use Envoy v1.19.1 compiled in-house with enabling FIPS. We previously used Envoy v1.17.0 and observed the same behavior.
Repro steps: We cannot provide configuration required to reproduce this, the configuration is built dynamically by our xDS services and contains too many sensitive details to be redacted manually.
Admin and Stats Output: No data, see above
Note: If there are privacy concerns, sanitize the data prior to sharing.
Config: No data, see above
Logs: No logs
Note: If there are privacy concerns, sanitize the data prior to sharing.
You will need to acquire a heap profile so we can see what is going on. I can't remember when we switched to the new tcmalloc by default but if that version is still using gperftools you might be able to get it out of box. Otherwise you will need to recompile with gperftools until we fix https://github.com/envoyproxy/envoy/issues/16506.
cc @wbpcode @jmarantz
@mattklein123 Thanks for your advise. It will take us some time to run, deploy and collect the data so please keep this issue opened until then.
Hi,
is it possible for you to try envoy 1.15.5?
Not sure if it's related, but in our infrastructure 1.15.x works fine, but from that point on, envoy keeps being OOM killed too.
I've figured out that building envoy with --define tcmalloc=gperftools
fix this issue for us although IDK why the other implementation of tcmalloc doesn't work as expected.
@TomasKohout thanks for suggestion, I will need to consul this with my team.
We are currently working on building one of the more recent versions of Envoy with memory profiling enabled.
@mattklein123 Hi Matt,
We have custom-built Envoy v1.21.2 with enabled tcmalloc=gperftools
.
Do you have any specific requirements / procedure on how the heap profile should be taken?
Take a look at the instructions here: https://gperftools.github.io/gperftools/heapprofile.html
You should be able to generate graph view plots of heap usage over time which should tell us what is going on. Thank you!
@mattklein123
We ran some tests using custom-built Envoy v1.21.2 with --define tcmalloc=gperftools
. We also installed pprof
binary into the same container as Envoy. We can see the heap profile files being generated however when we try to run pprof
we don't see too many details, for ex.:
# pprof -top /usr/local/bin/envoy envoy.hprof.0003.heap
Some binary filenames not available. Symbolization may be incomplete.
Try setting PPROF_BINARY_PATH to the search path for local binaries.
File: envoy
Type: inuse_space
Showing nodes accounting for 6012.23kB, 99.87% of 6019.83kB total
Dropped 1 node (cum <= 30.10kB)
flat flat% sum% cum cum%
6010.69kB 99.85% 99.85% 6011.53kB 99.86% [envoy]
1.55kB 0.026% 99.87% 4616.95kB 76.70% [libc.so.6]
0 0% 99.87% 3279.10kB 54.47% <unknown>
Are we missing some steps with our setup?
Are you using an Envoy with debug symbols? How did you compile it?
We needed Envoy to use FIPS-enabled SSL so we've updated ./ci/do_ci.sh
and added:
...
exit 0
elif [[ "$CI_TARGET" == "bazel.release.fips" ]]; then
# to build this via docker run `./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.release.fips'`
BAZEL_BUILD_OPTIONS+=("--define" "boringssl=fips" "--define" "tcmalloc=gperftools" "--define" "exported_symbols=enabled")
setup_clang_toolchain
echo "bazel release fips enabled build..."
bazel_envoy_binary_build release
exit 0
elif [[ "$CI_TARGET" == "bazel.release.server_only" ]]; then
....
and then we ran
./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.release.fips'
and then we made a Docker image using ./ci/Dockerfile-envoy-alpine
.
That should have symbols. I don't know off the top of my head unfortunately. You will need to debug.
./ci/Dockerfile-envoy-alpine
has these lines:
ARG ENVOY_BINARY_SUFFIX=_stripped
ADD linux/amd64/build_envoy_release${ENVOY_BINARY_SUFFIX}/* /usr/local/bin/
Can this cause the final image to contain stripped binary? If "yes" how can we change it?
Fix the dockerfile or just don't use alpine, use the debug image.
FYI, we now verified that we can collect heap profiles with our custom Envoy v1.21.2 We are going to run it in our environment for few days and collect heap profiles. We'll post the results shortly.
We noticed that enabling the heap profiler by POST /heapprofiler?enabled=y
causes significant delays on Prometheus stats endpoint. With heap profiler enabled running GET /stats/prometheus
takes ~8-9 seconds, without heap profiler it takes just ~100 milliseconds.
Is that expected?
seems plausible. Prometheus stats allocate a very large amount of temp storage so it's not surprising it stresses the memory profiler.
@jmarantz thanks for clarification!
After upgrading from v1.19.1 to 1.21.2 (both these versions were custom-built with FIPS enabled) we are seeing difference in physical memory usage:
Were you triggering the overload manager?
We have a configuration in place to trigger:
-
envoy.overload_actions.shrink_heap
when reached 95% of max memory -
envoy.overload_actions.stop_accepting_requests
when reached 98% of max memory
These recommendations were taken from https://www.envoyproxy.io/docs/envoy/latest/configuration/best_practices/edge guidance.
We saw stop_accepting_requests
triggered once, on version v1.19.1. At the time we didn't have the alert in place to monitor Envoy's memory utilization and one of our instances managed to reach max of allocated 2Gb. Once stop_accepting_requests
was triggered we were alerted by increase of 503 Service Unavailable
errors from affected instance. We analyzed the cause and this is how we became aware of the leak. envoy.overload_actions.shrink_heap
was not triggered at that time or maybe it did trigger but was not able to release any memory.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Hi @aqua777, we observed a similar pattern in our service. Did you ever have a conclusion for this issue? Thanks!
Hi @haoruolei , see my post here: https://github.com/envoyproxy/envoy/issues/21092#issuecomment-1180536708
After upgrade we stopped seeing memory leak so we didn't do any further investigation.