envoy Suspected Envoy memory leak

Title: Suspected Envoy memory leak

Description: We are using Envoy in our network as an edge proxy for accessing backend services. All services proxied by Envoys are available for access for other internal services and some of these services need to be exposed externally, for clients outside of our network. Therefore, we have two instances of Envoy, "external" and "internal". Configuration for each of these instances are provided by corresponding "external" and "internal" instances of in-house built xDS service, which uses Consul for discovery of the backend nodes. Binary version of xDS service is same for "external" and "internal", ie. it uses the same logic to build Envoy configuration, the only difference is the number of services included in configuration, "external" vs "internal".

We monitor running Envoy instances, the metrics are being processed by Prometheus. We noticed the difference in memory utilization graphs, below you can see graphs showing memory utilization metrics for period of 7 days:

External
Internal

As you can see, the memory utilization graphs for "internal" instances show signs of memory leak while memory utilization for "external" instances does not show such signs.

Do you know about any issues which could possibly lead to such memory leaks?

Envoy Version: We currently use Envoy v1.19.1 compiled in-house with enabling FIPS. We previously used Envoy v1.17.0 and observed the same behavior.

Repro steps: We cannot provide configuration required to reproduce this, the configuration is built dynamically by our xDS services and contains too many sensitive details to be redacted manually.

Admin and Stats Output: No data, see above

Note: If there are privacy concerns, sanitize the data prior to sharing.

Config: No data, see above

Logs: No logs

Note: If there are privacy concerns, sanitize the data prior to sharing.

Apr 29 '22 14:04 aqua777

You will need to acquire a heap profile so we can see what is going on. I can't remember when we switched to the new tcmalloc by default but if that version is still using gperftools you might be able to get it out of box. Otherwise you will need to recompile with gperftools until we fix https://github.com/envoyproxy/envoy/issues/16506.

Apr 29 '22 14:04 mattklein123

cc @wbpcode @jmarantz

Apr 29 '22 14:04 mattklein123

@mattklein123 Thanks for your advise. It will take us some time to run, deploy and collect the data so please keep this issue opened until then.

Apr 29 '22 14:04 aqua777

Hi,

is it possible for you to try envoy 1.15.5?

Not sure if it's related, but in our infrastructure 1.15.x works fine, but from that point on, envoy keeps being OOM killed too.

I've figured out that building envoy with --define tcmalloc=gperftools fix this issue for us although IDK why the other implementation of tcmalloc doesn't work as expected.

May 11 '22 07:05 TomasKohout

@TomasKohout thanks for suggestion, I will need to consul this with my team.

We are currently working on building one of the more recent versions of Envoy with memory profiling enabled.

Jun 03 '22 11:06 aqua777

@mattklein123 Hi Matt, We have custom-built Envoy v1.21.2 with enabled tcmalloc=gperftools. Do you have any specific requirements / procedure on how the heap profile should be taken?

Jun 05 '22 15:06 aqua777

Take a look at the instructions here: https://gperftools.github.io/gperftools/heapprofile.html

You should be able to generate graph view plots of heap usage over time which should tell us what is going on. Thank you!

Jun 06 '22 15:06 mattklein123

@mattklein123

We ran some tests using custom-built Envoy v1.21.2 with --define tcmalloc=gperftools. We also installed pprof binary into the same container as Envoy. We can see the heap profile files being generated however when we try to run pprof we don't see too many details, for ex.:

# pprof -top /usr/local/bin/envoy envoy.hprof.0003.heap
Some binary filenames not available. Symbolization may be incomplete.
Try setting PPROF_BINARY_PATH to the search path for local binaries.
File: envoy
Type: inuse_space
Showing nodes accounting for 6012.23kB, 99.87% of 6019.83kB total
Dropped 1 node (cum <= 30.10kB)
      flat  flat%   sum%        cum   cum%
 6010.69kB 99.85% 99.85%  6011.53kB 99.86%  [envoy]
    1.55kB 0.026% 99.87%  4616.95kB 76.70%  [libc.so.6]
         0     0% 99.87%  3279.10kB 54.47%  <unknown>

Are we missing some steps with our setup?

Jun 08 '22 12:06 aqua777

Are you using an Envoy with debug symbols? How did you compile it?

Jun 08 '22 14:06 mattklein123

We needed Envoy to use FIPS-enabled SSL so we've updated ./ci/do_ci.sh and added:

  ...
  exit 0
elif [[ "$CI_TARGET" == "bazel.release.fips" ]]; then
  # to build this via docker run `./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.release.fips'`
  BAZEL_BUILD_OPTIONS+=("--define" "boringssl=fips" "--define" "tcmalloc=gperftools" "--define" "exported_symbols=enabled")
  setup_clang_toolchain
  echo "bazel release fips enabled build..."
  bazel_envoy_binary_build release
   exit 0
 elif [[ "$CI_TARGET" == "bazel.release.server_only" ]]; then
 ....

and then we ran

./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.release.fips'

and then we made a Docker image using ./ci/Dockerfile-envoy-alpine.

Jun 08 '22 15:06 aqua777

That should have symbols. I don't know off the top of my head unfortunately. You will need to debug.

Jun 08 '22 15:06 mattklein123

./ci/Dockerfile-envoy-alpine has these lines:

ARG ENVOY_BINARY_SUFFIX=_stripped
ADD linux/amd64/build_envoy_release${ENVOY_BINARY_SUFFIX}/* /usr/local/bin/

Can this cause the final image to contain stripped binary? If "yes" how can we change it?

Jun 08 '22 15:06 aqua777

Fix the dockerfile or just don't use alpine, use the debug image.

Jun 08 '22 16:06 mattklein123

FYI, we now verified that we can collect heap profiles with our custom Envoy v1.21.2 We are going to run it in our environment for few days and collect heap profiles. We'll post the results shortly.

Jun 10 '22 13:06 aqua777

We noticed that enabling the heap profiler by POST /heapprofiler?enabled=y causes significant delays on Prometheus stats endpoint. With heap profiler enabled running GET /stats/prometheus takes ~8-9 seconds, without heap profiler it takes just ~100 milliseconds.

Is that expected?

Jun 28 '22 13:06 aqua777

seems plausible. Prometheus stats allocate a very large amount of temp storage so it's not surprising it stresses the memory profiler.

Jun 28 '22 14:06 jmarantz

@jmarantz thanks for clarification!

Jun 28 '22 14:06 aqua777

After upgrading from v1.19.1 to 1.21.2 (both these versions were custom-built with FIPS enabled) we are seeing difference in physical memory usage: Screenshot 2022-07-11 at 16 05 27

Jul 11 '22 15:07 aqua777

Were you triggering the overload manager?

Jul 11 '22 16:07 jmarantz

We have a configuration in place to trigger:

envoy.overload_actions.shrink_heap when reached 95% of max memory
envoy.overload_actions.stop_accepting_requests when reached 98% of max memory

These recommendations were taken from https://www.envoyproxy.io/docs/envoy/latest/configuration/best_practices/edge guidance. We saw stop_accepting_requests triggered once, on version v1.19.1. At the time we didn't have the alert in place to monitor Envoy's memory utilization and one of our instances managed to reach max of allocated 2Gb. Once stop_accepting_requests was triggered we were alerted by increase of 503 Service Unavailable errors from affected instance. We analyzed the cause and this is how we became aware of the leak. envoy.overload_actions.shrink_heap was not triggered at that time or maybe it did trigger but was not able to release any memory.

Jul 11 '22 16:07 aqua777

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Aug 10 '22 20:08 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Aug 18 '22 00:08 github-actions[bot]

Hi @aqua777, we observed a similar pattern in our service. Did you ever have a conclusion for this issue? Thanks!

Jul 06 '23 01:07 haoruolei

Hi @haoruolei , see my post here: https://github.com/envoyproxy/envoy/issues/21092#issuecomment-1180536708

After upgrade we stopped seeing memory leak so we didn't do any further investigation.

Jul 06 '23 16:07 aqua777

envoy envoy copied to clipboard

Suspected Envoy memory leak

envoy
envoy copied to clipboard