envoy memory heap not released long after overload manager actions triggered

Title: memory heap not released long(~24h) after overload manager actions stop_accepting_connections/requests triggered

Description: We are recently testing overload manager with istio and have questions about memory release behaviors. Our scenario is to overload the ingressgateway heap_size configured via sending requests from large number of concurrent connections, then we see stop_accepting_connections actions triggered and new requests will fail. The unexpected issue is:

the "heap_size" is not released after almost 24h even if we have shrink_heap configured and triggered already and we do see pageheap_unmapped_bytes and allocated drops, so the ingressgateway pod is hang and can't recover from overload situation to take new requests. EXPECTED behavior is "heap_size" should drop (gradually). Also we want to ask, what is the releasing memory or shrinking heap behavior? Which parametes will drop and why the heap_size stays high forever?
The point is more of a question: ingressgw pod is 0/1 Running as we configured the readiness probe. So the stop_accepting_connections(also stop_accepting_requests) stops the internal probes for container. We want to ask, is this reasonable to exclude that from overload actions or it's intentionally that all connections/requests are rejected including the probes?

Repro steps: Deploy istio 1.13 with overload manager configured as maxHeapSize: 30000000 heapShrinkPercent: 0.5 heapStopAcceptConnPercent: 0.7 Deploy fortioclient and httpbin server, making HTTP2 requests via ingressgateway with 300 concurrent connections via kubectl exec "$FORTIO_POD" -c fortioclient -n istio-system -- /usr/bin/fortio load -c 300 -qps 1000 -t 30s -stdclient -loglevel Warning http://10.106.87.64/getso that the memory "heap_size" exceed the threshold for envoy.overload_actions.stop_accepting_connections. Then new connection/request fails to be established.

k get pods
NAME                                    READY   STATUS    RESTARTS   AGE
istiod-74dc6d5578-chbtp                 1/1     Running   0          89s
fortioclient-5db8c6576c-mbjnz           3/3     Running   0          6d2h
httpbin-78f4fd5455-8r9lb                2/2     Running   0          6d2h
httpbin-78f4fd5455-lqsvh                2/2     Running   0          6d2h
istio-egressgateway-77bbd94c4c-wmt77    1/1     Running   0          89s
istio-ingressgateway-6b5699fb7d-w5dn2   1/1     Running   0          89s

kubectl exec -ti pod/istio-ingressgateway-6b5699fb7d-lv89q -n istio-system -- bash
bash-4.4$ ps -auxwwf
201507      26  0.3  0.2 212528 48268 ?        Sl   11:17   0:00  |   \_ /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --local-address-ip-version v4 --file-flush-interval-msec 1000 --disable-hot-restart --log-format %Y-%m-%dT%T.%fZ.%l.envoy %n.%v -l info --component-log-level misc:error --config-yaml {   "overload_manager": {     "refresh_interval": {       "seconds": 0,       "nanos": 250000000,     },     "resource_monitors": [       {         "name": "envoy.resource_monitors.fixed_heap",         "typed_config": {           "@type": "type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig",           "max_heap_size_bytes": 30000000         }       }     ],     "actions": [       {         "name": "envoy.overload_actions.shrink_heap",         "triggers": [           {             "name": "envoy.resource_monitors.fixed_heap",             "threshold": {               "value": 0.5             }           }         ]       },       {         "name": "envoy.overload_actions.stop_accepting_requests",         "triggers": [           {             "name": "envoy.resource_monitors.fixed_heap",             "threshold": {               "value": 1             }           }         ]       },       {         "name": "envoy.overload_actions.stop_accepting_connections",         "triggers": [           {             "name": "envoy.resource_monitors.fixed_heap",             "threshold": {               "value": 0.7             }           }         ]       }     ]   }, }

Admin and Stats Output:

// initial heap_size
[15:46]$ kubectl exec -ti pod/istio-ingressgateway-6b5699fb7d-dm5fc -n istio-system -- curl localhost:15000/memory
{
 "allocated": "8419232",
 "heap_size": "10485760",
 "pageheap_unmapped": "0",
 "pageheap_free": "827392",
 "total_thread_cache": "753040",
 "total_physical_bytes": "13107200"
}
// heap_size increases as we are sending requests
[15:47]$ kubectl exec -ti pod/istio-ingressgateway-6b5699fb7d-dm5fc -n istio-system -- curl localhost:15000/memory
{
 "allocated": "14097408",
 "heap_size": "19922944",
 "pageheap_unmapped": "761856",
 "pageheap_free": "0",
 "total_thread_cache": "4378920",
 "total_physical_bytes": "21782528"
}
// stop accepting conn triggered
[15:49]$ kubectl exec -ti pod/istio-ingressgateway-6b5699fb7d-dm5fc -n istio-system -- curl localhost:15000/memory
{
 "allocated": "14426192",
 "heap_size": "22020096",
 "pageheap_unmapped": "958464",
 "pageheap_free": "0",
 "total_thread_cache": "5730432",
 "total_physical_bytes": "23683072"
}

// after one day
[09:02]$ kubectl exec -ti pod/istio-ingressgateway-6b5699fb7d-dm5fc -n istio-system -- curl localhost:15000/memory
{
 "allocated": "8407824",
 "heap_size": "22020096",
 "pageheap_unmapped": "16384",
 "pageheap_free": "0",
 "total_thread_cache": "5886176",
 "total_physical_bytes": "24625152"
}
$ k get pods
NAME                                    READY   STATUS    RESTARTS   AGE
istiod-74dc6d5578-xtbrj			1/1     Running   0          17h
fortioclient-5db8c6576c-mbjnz           3/3     Running   0          22h
httpbin-78f4fd5455-8r9lb                2/2     Running   0          22h
httpbin-78f4fd5455-lqsvh                2/2     Running   0          22h
istio-ingressgateway-6b5699fb7d-dm5fc   0/1     Running   0          17h
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  102s (x28157 over 17h)  kubelet  Readiness probe failed: Get "http://172.17.0.5:15021/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Config:

Include the config used to configure Envoy.

Logs: server_info.log

Call Stack:

If the Envoy binary is crashing, a call stack is required. Please refer to the Bazel Stack trace documentation.

Jun 28 '22 11:06 caoyukun0430

@kbaichoo Hi, suggested by @htuch from conversation on slack channel, it would be great if you could take a look on the issue, thanks!

Jun 28 '22 12:06 caoyukun0430

Friendly ping @KBaichoo

Jun 29 '22 13:06 wbpcode

And could you check your shrink_count to ensure the shrink action is triggered. Thanks.

Jun 29 '22 13:06 wbpcode

max Heap Size is: 30MB heapShrinkPercent: 0.5 heapStopAcceptConnPercent: 0.7

Your application heap size ranges from 10MB - 22MB from the data provided, and for most of the provided samples page heap free is 0.

As this is Istio, we're using gperftools, as such we can see that our call to https://github.com/gperftools/gperftools/blob/bf8b714bf5075d0a6f2f28504b43095e2b1e11c5/src/gperftools/malloc_extension.h#L287 will cause https://github.com/gperftools/gperftools/blob/bf8b714bf5075d0a6f2f28504b43095e2b1e11c5/src/gperftools/malloc_extension.h#L216 pageheap_unmapped to increase, which we see.

From description of: tcmalloc.pageheap_unmapped_bytes: They can be used to fulfill allocation requests, but typically incur a page fault. They always count towards virtual memory usage, and depending on the OS, typically do not count towards physical memory usage.

I think the resource utilization (fixed heap) logic would end up discounting the unmapped pages, so we'd typically be getting ~21/22MB heap size physical pages with 30MB being the limit. 21/30 = 0.7 => we'd be right on the cusp of StopAcceptingConnections, and there's effectively no free memory, so there's nothing for TCMalloc to release to the system -- note that pageheap free is 0 in your examples after stop accepting connections.

I think you might tune your fixed heap, overload manager actions by figuring out how much memory your expected workload consumes.

Jun 29 '22 20:06 KBaichoo

And could you check your shrink_count to ensure the shrink action is triggered. Thanks.

Thanks for the hint! But where can I use the shrink_count parameter to check? Is it exposed as a stat?

Jul 01 '22 07:07 caoyukun0430

max Heap Size is: 30MB heapShrinkPercent: 0.5 heapStopAcceptConnPercent: 0.7

Your application heap size ranges from 10MB - 22MB from the data provided, and for most of the provided samples page heap free is 0.

As this is Istio, we're using gperftools, as such we can see that our call to https://github.com/gperftools/gperftools/blob/bf8b714bf5075d0a6f2f28504b43095e2b1e11c5/src/gperftools/malloc_extension.h#L287 will cause https://github.com/gperftools/gperftools/blob/bf8b714bf5075d0a6f2f28504b43095e2b1e11c5/src/gperftools/malloc_extension.h#L216 pageheap_unmapped to increase, which we see.

From description of: tcmalloc.pageheap_unmapped_bytes: They can be used to fulfill allocation requests, but typically incur a page fault. They always count towards virtual memory usage, and depending on the OS, typically do not count towards physical memory usage.

I think the resource utilization (fixed heap) logic would end up discounting the unmapped pages, so we'd typically be getting ~21/22MB heap size physical pages with 30MB being the limit. 21/30 = 0.7 => we'd be right on the cusp of StopAcceptingConnections, and there's effectively no free memory, so there's nothing for TCMalloc to release to the system -- note that pageheap free is 0 in your examples after stop accepting connections.

I think you might tune your fixed heap, overload manager actions by figuring out how much memory your expected workload consumes.

Hi @KBaichoo thank you very much for your explaination of pageheap_unmapped and pageheap_free. So if I understand correctly, pageheap_free bytes is what the shrink_heap actions(or ReleaseFreeMemory() function) can release.

But still it doesn't solve our issue: "heap_size" is increased due to traffic, now traffic has ended for ~24h, what's the reason to still keep the "heap_size" high? OR what is inside the heap_size that keeps it high? Would it be more logic that proxy returns to the state which is close to the state before traffic started, i.e. a much smaller "heap_size"?

Jul 01 '22 12:07 caoyukun0430

There are structures that get created from traffic e.g. stats that did exist in the initial state; as such we'd never go back to the initial heap size. I would recommend taking a heap profile to see where the memory is by following steps in this thread: https://github.com/envoyproxy/envoy/issues/21092#issuecomment-1147578405 to see exactly where your 10MB that remains after no traffic is.

Jul 01 '22 14:07 KBaichoo

There are structures that get created from traffic e.g. stats that did exist in the initial state; as such we'd never go back to the initial heap size. I would recommend taking a heap profile to see where the memory is by following steps in this thread: #21092 (comment) to see exactly where your 10MB that remains after no traffic is.

Sorry for the late reply, it took some time for us to gather data. I build envoy with "-c dbg" and gather the heap profiler while I am making traffic until the ingressgateway overloaded.

So serveral questions are still in my head:

during the traffic we saw "pageheap_free" drops from some bytes to zero, why is the heap_size remains the same?

 [14:02][home/ecaoyuk/servicemesh]$ k exec -ti istio-ingressgateway-9bf5485c6-9zdch -- curl localhost:15000/memory
{
 "allocated": "15088600",
 "heap_size": "23068672",
 "pageheap_unmapped": "0",
 "pageheap_free": "712704",
 "total_thread_cache": "5152088",
 "total_physical_bytes": "25690112"
}
 [14:02][home/ecaoyuk/servicemesh]$ k exec -ti istio-ingressgateway-9bf5485c6-9zdch -- curl localhost:15000/memory
{
 "allocated": "15085912",
 "heap_size": "23068672",
 "pageheap_unmapped": "638976",
 "pageheap_free": "0",
 "total_thread_cache": "4787872",
 "total_physical_bytes": "25051136"
}

Maybe you could explain what is the "heap_size"? My suspision after reading from here https://stackoverflow.com/questions/43065076/does-os-have-to-deallocate-processs-heap-memory is: "heap_size" is the memory allocated by the OS to the program(envoy), and the pair malloc()/free() (and their siblings and equivalents) do not interact with the OS, they are allocated and freed by program within the block given by OS. If so "heap_size" will increase when envoy require more mem and will only be cleaned with envoy exits IF this is the case, then the logic that envoy uses "heap_size" to judge if application is overloaded or not is not correct, parameter "allocated" or some metrics indicating in-use heap should be better. Because going back to original question, why should the ingressgateway being kept in overload condition forever when traffic ends?

I have attached the pdf and pb.gz for heaps I collected and also I tried both options "pprof --alloc_space" and "inuse_space" because inuse_space only returns me kbs of mem usage, which doesn't seems correct. [heapprofiler.zip](https://github.com/envoyproxy/envoy/files/9073223/heapprofiler.zip)

 [16:39][tmp/envoy/lib]$ sudo PPROF_BINARY_PATH=/tmp/envoy/lib/ /root/go/bin/pprof --alloc_space /tmp/envoy/lib/envoy.full /tmp/envoy/envoy.prof.0001.heap
Enter Windows password:
Some binary filenames not available. Symbolization may be incomplete.
Try setting PPROF_BINARY_PATH to the search path for local binaries.
File: envoy.full
Type: alloc_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 379.10MB, 83.66% of 453.13MB total
Dropped 329 nodes (cum <= 2.27MB)
Showing top 10 nodes out of 125
      flat  flat%   sum%        cum   cum%
  146.06MB 32.23% 32.23%   146.06MB 32.23%  _GLOBAL__sub_I_quiche_flags_impl.cc
   71.11MB 15.69% 47.93%    71.11MB 15.69%  Envoy::Http::HeaderMapImpl::clear
   42.81MB  9.45% 57.38%    42.81MB  9.45%  Envoy::Buffer::OwnedImpl::linearize
   29.92MB  6.60% 63.98%    30.73MB  6.78%  absl::container_internal::raw_hash_set::extract
   23.95MB  5.29% 69.26%   250.51MB 55.29%  <unknown>
   16.82MB  3.71% 72.98%    47.73MB 10.53%  Envoy::Router::Filter::decodeHeaders
   16.18MB  3.57% 76.55%    16.18MB  3.57%  Envoy::Http::HeaderMapImpl::setCopy
   13.77MB  3.04% 79.59%    13.77MB  3.04%  Envoy::Http::FilterManager::encodeMetadata
   11.07MB  2.44% 82.03%    11.07MB  2.44%  Envoy::Extensions::Common::Aws::SignatureConstantValues::SignatureConstantValues
    7.38MB  1.63% 83.66%   427.71MB 94.39%  spdlog::logger::log_
 [16:40][tmp/envoy/lib]$ sudo PPROF_BINARY_PATH=/tmp/envoy/lib/ /root/go/bin/pprof /tmp/envoy/lib/envoy.full /tmp/envoy/envoy.prof.0001.heap
Some binary filenames not available. Symbolization may be incomplete.
Try setting PPROF_BINARY_PATH to the search path for local binaries.
File: envoy.full
Type: inuse_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2427.34kB, 79.78% of 3042.50kB total
Dropped 197 nodes (cum <= 15.21kB)
Showing top 10 nodes out of 174
      flat  flat%   sum%        cum   cum%
  904.82kB 29.74% 29.74%   904.82kB 29.74%  poly_mul_vec_aux
  383.09kB 12.59% 42.33%   383.09kB 12.59%  OPENSSL_lh_insert
  337.73kB 11.10% 53.43%   339.42kB 11.16%  absl::container_internal::raw_hash_set::extract
     288kB  9.47% 62.90%      288kB  9.47%  Envoy::Buffer::OwnedImpl::linearize
  148.80kB  4.89% 67.79%   254.12kB  8.35%  Envoy::Http::NoConnectCodecClientProd::NoConnectCodecClientProd
  115.78kB  3.81% 71.59%  2609.54kB 85.77%  spdlog::logger::log_
  103.44kB  3.40% 74.99%   103.44kB  3.40%  Envoy::Http::HeaderMapImpl::clear
   49.08kB  1.61% 76.61%  1100.01kB 36.15%  <unknown>
   48.48kB  1.59% 78.20%    57.68kB  1.90%  non-virtual thunk to Envoy::Extensions::TransportSockets::Tls::ServerContextConfigImpl::~ServerContextConfigImpl()
   48.12kB  1.58% 79.78%    48.12kB  1.58%  _GLOBAL__sub_I_quiche_flags_impl.cc
 [16:44][/tmp/envoy]$ sudo PPROF_BINARY_PATH=/tmp/envoy/lib/ /root/go/bin/pprof --alloc_space /tmp/envoy/lib/envoy.full /tmp/envoy/envoy.prof.0008.heap
Some binary filenames not available. Symbolization may be incomplete.
Try setting PPROF_BINARY_PATH to the search path for local binaries.
File: envoy.full
Type: alloc_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 137.35MB, 98.97% of 138.77MB total
Dropped 255 nodes (cum <= 0.69MB)
Showing top 10 nodes out of 88
      flat  flat%   sum%        cum   cum%
   88.23MB 63.58% 63.58%    88.23MB 63.58%  Envoy::Buffer::OwnedImpl::linearize
   26.41MB 19.03% 82.61%    26.41MB 19.03%  nghttp2_http_on_header
   16.99MB 12.24% 94.85%    16.99MB 12.24%  nghttp2_hd_inflate_new2
    3.53MB  2.54% 97.40%     3.53MB  2.54%  poly_mul_vec_aux
    0.98MB  0.71% 98.10%     0.98MB  0.71%  OPENSSL_lh_insert
    0.72MB  0.52% 98.62%     0.72MB  0.52%  absl::container_internal::raw_hash_set::extract
    0.31MB  0.22% 98.84%   137.69MB 99.22%  spdlog::logger::log_
    0.08MB 0.055% 98.90%     0.86MB  0.62%  std::__1::__function::__func::operator()
    0.07MB 0.048% 98.95%     1.28MB  0.92%  Envoy::Event::DispatcherImpl::~DispatcherImpl
    0.03MB 0.025% 98.97%     2.77MB  1.99%  Envoy::Http::FixedHttpConnPoolImpl::FixedHttpConnPoolImpl
(pprof) exit
 [16:44][/tmp/envoy]$ sudo PPROF_BINARY_PATH=/tmp/envoy/lib/ /root/go/bin/pprof /tmp/envoy/lib/envoy.full /tmp/envoy/envoy.prof.0008.heap
Some binary filenames not available. Symbolization may be incomplete.
Try setting PPROF_BINARY_PATH to the search path for local binaries.
File: envoy.full
Type: inuse_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 5462.65kB, 88.79% of 6152.22kB total
Dropped 201 nodes (cum <= 30.76kB)
Showing top 10 nodes out of 142
      flat  flat%   sum%        cum   cum%
 2314.77kB 37.62% 37.62%  2314.77kB 37.62%  poly_mul_vec_aux
 1003.32kB 16.31% 53.93%  1003.32kB 16.31%  OPENSSL_lh_insert
  734.77kB 11.94% 65.88%   734.77kB 11.94%  absl::container_internal::raw_hash_set::extract
  394.94kB  6.42% 72.30%   674.50kB 10.96%  Envoy::Http::NoConnectCodecClientProd::NoConnectCodecClientProd
     352kB  5.72% 78.02%      352kB  5.72%  Envoy::Buffer::OwnedImpl::linearize
  307.39kB  5.00% 83.01%  5142.33kB 83.58%  spdlog::logger::log_
  128.69kB  2.09% 85.11%   153.09kB  2.49%  non-virtual thunk to Envoy::Extensions::TransportSockets::Tls::ServerContextConfigImpl::~ServerContextConfigImpl()
   79.88kB  1.30% 86.40%    79.88kB  1.30%  virtual thunk to Envoy::Network::ServerConnectionImpl::setTransportSocketConnectTimeout(std::__1::chrono::duration, Envoy::Stats::Counter&)
   75.91kB  1.23% 87.64%   883.53kB 14.36%  std::__1::__function::__func::operator()
      71kB  1.15% 88.79%       71kB  1.15%  absl::container_internal::raw_hash_set::prepare_insert

Jul 08 '22 15:07 caoyukun0430

the traffic we saw "pageheap_free" drops from some bytes to zero, why is the heap_size remains the same? [What is heap size]?

Heap size is from https://github.com/envoyproxy/envoy/blob/2f8b98691bec3df6dd47c88ef6524262e88dd4a7/source/common/memory/stats.cc#L67 in Envoy which is defined here in gperftools tcmalloc: https://github.com/gperftools/gperftools/blob/bf8b714bf5075d0a6f2f28504b43095e2b1e11c5/src/gperftools/malloc_extension.h#L164 Re-iterated here: Number of bytes in the heap == current_allocated_bytes +fragmentation + freed memory regions

Then the logic that envoy uses "heap_size" to judge if application is overloaded or not is not correct, parameter "allocated" or some metrics indicating in-use heap should be better

The logic for the fixed heap calculation of resource usage is here: https://github.com/envoyproxy/envoy/blob/2f8b98691bec3df6dd47c88ef6524262e88dd4a7/source/extensions/resource_monitors/fixed_heap/fixed_heap_monitor.cc#L25. We get the generic.heap_size as physical and get the tcmalloc.pageheap_unmapped_bytes as unmapped and subtract the two. physical - unmapped => current_allocated_bytes + fragmentation + freed memory regions - unmapped => current_allocated_bytes + fragmentation + freed memory regions - tcmalloc.pageheap_unmapped_bytes

If we just used current_allocated_bytes we'd miss memory that's counting against the process above.

From your in space profile it seems like you still have connections pools allocated. Nothing pops out too much.

I think fragmentation and caching inside the malloc implementation could be behind some of the differences in the in use space pprof and the stats. Note the total thread cache is ~5MB in both your earlier examples, and from https://github.com/gperftools/gperftools/blob/bf8b714bf5075d0a6f2f28504b43095e2b1e11c5/src/gperftools/malloc_extension.h#L183 it can grow to 16MB by default. Perhaps measuring fragmentation and switching to using smaller page sizes to reduce fragmentation could also help.

I'm not too familiar with tuning Envoy for small O(10)MBs use cases, perhaps @kyessenov or @alyssawilk might be more familiar

Jul 11 '22 21:07 KBaichoo

Thanks for the explaination! Still I am afraid that my questions still exists:

Overall it's not some small O(10)MBs use cases, I was testing with O(10Mb) because it's easy to reach, I think it's comparable to O(100Mb) use cases. I tested today with 1000 connections so that "heap_size" reaches >100 MB. We thought maybe our former tests used too small heap to have OS triggered some free heap collection, but results here showed that after quite some hours, heap_size stays high while current_allocated_bytes drops.

[15:12]$  kubectl exec -ti pod/istio-ingressgateway-596fc5548-l78vv -n istio-system curl localhost:15000/memory
{
 "allocated": "59898624",
 "heap_size": "114294784",
 "pageheap_unmapped": "1007616",
 "pageheap_free": "0",
 "total_thread_cache": "23402752",
 "total_physical_bytes": "118530048"
}
[22:11]$  kubectl exec -ti pod/istio-ingressgateway-596fc5548-l78vv -n istio-system curl localhost:15000/memory
{
 "allocated": "8812560",
 "heap_size": "114294784",
 "pageheap_unmapped": "11173888",
 "pageheap_free": "0",
 "total_thread_cache": "25954912",
 "total_physical_bytes": "108363776"
}

As you explained, heap_size == current_allocated_bytes +fragmentation + freed memory regions, I am not sure what contains in fragmentation (I just assume it's part of allocated/used bytes). So since we can see heap_size remains same and allocated drops, so we can say the freed memory regions increase accordingly! Then the overload manager considered used==physical - unmapped==current_allocated_bytes + fragmentation + freed memory regions - unmapped. Since it's free memory, why would we include it into the used bytes? In my view "used==current_allocated_bytes + fragmentation - unmapped" is more reasonable. Also if the logic is changed into this, as long as the allocated drops(meaning free memory increases) we will have used decreasing. Only after excluding free memory from heap_size or used will the ingressgw pod be freed from the overload status right?
You stated "From your in space profile it seems like you still have connections pools allocated.", which data indicates that？
“ Perhaps measuring fragmentation and switching to using smaller page sizes to reduce fragmentation could also help.” from the latest test, we have thread cache 25MB. And where can we see fragmentation used and how to shrink that?

Jul 12 '22 20:07 caoyukun0430

I think it's comparable to O(100Mb) use cases. I tested today with 1000 connections so that "heap_size" reaches >100 MB.

Would be useful to get the space usage for those cases then since that is the real use case. I think with the 30MB case, the memory can easily be consume e.g. by fragmentation.

I am not sure what contains in fragmentation

Not an expert on gperftools tcmalloc, usually it uses a page for various "object sizes" unless we're talking about large >1MB allocations. As such fragmentation can occur if say on an 8K page split into 128B objects, we have 8 of these objects allocated but the remaining 7k is "fragmented" e.g. we can't free it due to the objects alloced on it, but the application isn't quite using it for anything useful. See https://gperftools.github.io/gperftools/tcmalloc.html for more.

In my view "used==current_allocated_bytes + fragmentation - unmapped" is more reasonable

Yes I agree, we should subtract Stats::totalPageHeapFree in the fixed heap monitor's computation of resource usage. This shouldn't count against the system.

"From your in space profile it seems like you still have connections pools allocated."

Look for FixedHttpConnPoolImpl / Client objects are used in the pools

“ Perhaps measuring fragmentation and switching to using smaller page sizes to reduce fragmentation could also help.” from the latest test, we have thread cache 25MB. And where can we see fragmentation used and how to shrink that?

This is really more a gperftools tcmalloc question. I think decreasing page sizes could decrease fragmentation as discussed below (worse for your TLB, run your profiles to see if tradeoff makes sense for you). See https://gperftools.github.io/gperftools/tcmalloc.html and its repo for more info.

Jul 12 '22 21:07 KBaichoo

In my view "used==current_allocated_bytes + fragmentation - unmapped" is more reasonable

Yes I agree, we should subtract Stats::totalPageHeapFree in the fixed heap monitor's computation of resource usage. This shouldn't count against the system.

Thanks for the reply, sorry the response is delayed for a while. Maybe here I just focus on this point since it's a more promising direction. So our agreement is to remove "free memory regions" from used heap calculation, i.e. used ==physical - unmapped == current_allocated_bytes + fragmentation - unmapped. Then my follow-up question is what consists of free memory regions? From your indication, it seems only totalPageHeapFree which is pageheap_free_bytes is free memory region. But I thought "free memory regions" would contain more from my experiments. From our previous discussion " heap == current_allocated_bytes +fragmentation + freed memory regions", and experiment below shows that allocated drops dramatically but pageheap_free is always zero! Do you think all heaps are transferred to fragmentation then? From my view, free memory regions contains more then pageheap_free, probably also pageheap_unmapped or other "free" bytes not listed here, otherwise it's hard to explain.

P.S. Do you think it's a reasonable feature/improvment that envoy can include in? Should we raise a PR for it?

[15:11]$  kubectl exec -ti pod/istio-ingressgateway-596fc5548-l78vv -n istio-system curl localhost:15000/memory
{
 "allocated": "64747984",
 "heap_size": "113246208",
 "pageheap_unmapped": "417792",
 "pageheap_free": "0",
 "total_thread_cache": "25709128",
 "total_physical_bytes": "118071296"
}
[22:11]$  kubectl exec -ti pod/istio-ingressgateway-596fc5548-l78vv -n istio-system curl localhost:15000/memory
{
 "allocated": "8812560",
 "heap_size": "114294784",
 "pageheap_unmapped": "11173888",
 "pageheap_free": "0",
 "total_thread_cache": "25954912",
 "total_physical_bytes": "108363776"
}

Jul 18 '22 08:07 caoyukun0430

@KBaichoo any further idea or suggestions? thanks

Aug 09 '22 13:08 caoyukun0430

From https://github.com/envoyproxy/envoy/pull/22585 we now discount free mapped pages from the utilization calculation.

Then my follow-up question is what consists of free memory regions?

From https://gperftools.github.io/gperftools/tcmalloc.html I think this is entire pages that are free that can be used from allocation. I think it doesn't include fragmentation.

Do you think all heaps are transferred to fragmentation then?

I'm not a TCMalloc expert, but I think it could be due to fragmentation. See https://github.com/gperftools/gperftools/issues/756#issuecomment-175179193 which seems relevant. Perhaps building with -DTCMALLOC_SMALL_BUT_SLOW would help in this or setting tcmalloc.max_total_thread_cache_bytes smaller as it defaults to 16MB https://github.com/gperftools/gperftools/blob/f7c6fb6c8e99d6b1b725e5994373bcd19ffdf8fd/src/gperftools/malloc_extension.h#L181

More broadly, to avoid your deployment from freezing up, you could consider some additional overload manager actions that shed memory: https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/overload_manager/overload_manager#config-overload-manager

With the actions you have configured the shrink heap isn't able to do anything as the memory it can easily free up based on how tcmalloc is configured is already freed. Your configuration does nothing to shed existing connections, leaving the system just stop accepting new connections. You might also want to adjust higher when you trigger stop accepting connections based on your expected workload.

Aug 09 '22 15:08 KBaichoo

Hi @KBaichoo thanks for the PR, I am not sure/afraid it won't solve my issue though. Ofc I will do more tests to verify.

my question above was asking what consists of "free memory region"? from my experiments "allocated" drops from 64747984 to 8812560, but "pageheap_free" stays zero, which mean subtracting totalPageHeapFree(pageheap_free_bytes) in the PR doesn't help release overloadMgr from stop accepting connections. It must have some other parts freed.
I already tried other options like heapShrinkPercent and stopaccepReq, but this behavior is regardless of actions, it's about heap_size never drop so that we can't be freed from states. And even when we give one larger threshold, it can also be reached.

So in my opinion, more evaluation should be performed on how to improve "fixed_heap_size" measurement and what consists of free memory

Aug 09 '22 15:08 caoyukun0430

Hi @KBaichoo my test results meet the expectation that the issue still persists.

"More broadly, to avoid your deployment from freezing up, you could consider some additional overload manager actions that shed memory" as you can see from my config, I also configured heapShrinkPercent to be a lower value so it's triggered before stop accepting req/conn to alleviate the issue, but it doesn't have visible effect. Other actions like Limiting Active Connections or reset stream doesn't help neither.
"Your configuration does nothing to shed existing connections, leaving the system just stop accepting new connections. " because the testing purpose is to trigger stop accepting xx actions and see how overload manager reactions, and setting higher threshold doesn't make much sense, there is also one way to achieve. And the main issue remains: memory are not properly freed up even with the PR so that ingressgw/envoy stuck in overload situation. I would suggest we take a deep look into what consists of free memory region again. Thanks!

Aug 15 '22 14:08 caoyukun0430

So in my opinion, more evaluation should be performed on how to improve "fixed_heap_size" measurement and what consists of free memory

@caoyukun0430 could you add stats for all of the *_free_bytes in https://github.com/gperftools/gperftools/blob/bf8b714bf5075d0a6f2f28504b43095e2b1e11c5/src/gperftools/malloc_extension.h#L189 to https://github.com/envoyproxy/envoy/blob/6b83db42c418abca6d21160cf6b1c439657d6ef0/source/common/memory/stats.h#L11? This might help us track down where the free bytes are in your application. adding an envoy admin endpoints that queries https://github.com/envoyproxy/envoy/blob/6b83db42c418abca6d21160cf6b1c439657d6ef0/source/common/memory/stats.cc#L96 would also help. Thanks!

Aug 15 '22 16:08 KBaichoo

Hi @KBaichoo thanks for the advice, I have tested them. I added 3 _free bytes parameters(tcmalloc.central_cache_free_bytes, tcmalloc.transfer_cache_free_bytes, tcmalloc.thread_cache_free_bytes)to Stats so that I am able to see them from /stats interface. Then I am not sure how to add new interface, so I raise logging level of dumpStatsToLog() and added it to FixedHeapMonitor::updateResourceUsage() function so that I can see from the pod logs.

k logs istio-ingressgateway-6997b477cb-kgsdw --tail=20 -f | grep -i tcma
2022-08-18T12:07:49.603861Z	warning	envoy misc	TCMalloc stats:
MALLOC:           8192              Tcmalloc page size

Useful results below are: stopAcceptingReq triggered at 14:11 (threshold I set is 50Mb) thread_cache_free_bytes remains high at ~12Mb before and after, but comparing results at 14:11 and 20:35 we see that memory released from allocated comes to central_cache_free_bytes and transfer_cache_free_bytes. Allocated 33.8Mb -> 8,6Mb, central_cache_free_bytes 1.8Mb -> 17.5Mb, transfer_cache_free_bytes 1.3 Mb -> 10Mb

So 1. do you know more about thread_cache_free_bytes, why it remains high? 2. In the PR above, we only consider pageheap_free as free mem and discount from used. From the results, do you agree that (part of )central_cache_free_bytes AND transfer_cache_free_bytes should also be considered as free memory and discount from used mem in context of fixed heap monitor?

[13:57]
server.memory_allocated: 31311280
server.memory_central_cache_free_bytes: 1706096
server.memory_heap_size: 46137344
server.memory_physical_size: 48365568
server.memory_thread_cache_free_bytes: 12146656
server.memory_transfer_cache_free_bytes: 432640
[14:11]
server.memory_allocated: 33847104
server.memory_central_cache_free_bytes: 1798472
server.memory_heap_size: 50331648
server.memory_physical_size: 52527104
server.memory_thread_cache_free_bytes: 12804984
server.memory_transfer_cache_free_bytes: 1324032
[20:35]
server.memory_allocated: 8655664
server.memory_central_cache_free_bytes: 17527888
server.memory_heap_size: 51380224
server.memory_physical_size: 50225152
server.memory_thread_cache_free_bytes: 11202688
server.memory_transfer_cache_free_bytes: 10086400

Aug 18 '22 18:08 caoyukun0430

Neat, can I get the output of the dumpStatsToLog(), it should look below and be even more detailed. https://github.com/gperftools/gperftools/blob/bf8b714bf5075d0a6f2f28504b43095e2b1e11c5/src/tcmalloc.cc#L404

do you know more about thread_cache_free_bytes, why it remains high?

TCMalloc tries to have a local per thread freelist from which it can quickly make allocations without having to grab a lock.

we only consider pageheap_free as free mem and discount from used. From the results, do you agree that (part of )central_cache_free_bytes AND transfer_cache_free_bytes should also be considered as free memory and discount from used mem in context of fixed heap monitor?

The reason we can discount pageheap_free is that it is empty pages that TCMalloc can easily return to the OS. All the other free memory cannot be easily returned to the OS due pieces of that chunk of memory is in use. As such discounting them is risky since we could easily go over the heap size measured externally and OOM if there's a resource limit on the job.

A large central freelist implies we have a longtail of memory spans (~a set of pages) sitting idle. If we had the data from dumpStatsToLog above, we might be able to spot particular fragmentation in a size class.

You might get some mileage from setting the env var TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to be lower to reduce how big the thread caches get. It's a tradeoff of cpu for memory, but might be fine for a smaller deployment.

Aug 19 '22 16:08 KBaichoo

Hi @KBaichoo thanks for the explain! Could you plz hint me how can I get the outputs as DumpStats() showed in tcmalloc.cc?

What I did is to increase logging level of ENVOY_LOG_MISC to warn, and then call dumpStatsToLog() in the two functions below so that I can see logs from kubectl logs

void Stats::dumpStatsToLog() {
 // ...
  ENVOY_LOG_MISC(warn, "TCMalloc stats:\n{}", buffer.get());
}

envoy/source/common/memory/utils.cc
@ void Utils::tryShrinkHeap() {
 #if defined(TCMALLOC) || defined(GPERFTOOLS_TCMALLOC)
   auto total_physical_bytes = Stats::totalPhysicalBytes();
   auto allocated_size_by_app = Stats::totalCurrentlyAllocated();
+  Stats::dumpStatsToLog();
   ASSERT(total_physical_bytes >= allocated_size_by_app);
envoy/source/extensions/resource_monitors/fixed_heap/fixed_heap_monitor.cc
@ void FixedHeapMonitor::updateResourceUsage(Server::ResourceUpdateCallbacks& call
       const size_t physical = stats_->reservedHeapBytes();
       const size_t unmapped = stats_->unmappedHeapBytes();
       const size_t free_mapped = stats_->freeMappedHeapBytes();
+      Memory::Stats::dumpStatsToLog();
       ASSERT(physical >= (unmapped + free_mapped));
       return physical - unmapped - free_mapped;
     } else {

But output above only shows one entry:

2022-08-18T12:07:49.603861Z	warning	envoy misc	TCMalloc stats:
MALLOC:           8192              Tcmalloc page size

What I observed is that valid print in dumpStatsToLog() is buffer.get() which returns char[]? Does this match outputs inside DumpStats() of tcmalloc.cc https://github.com/gperftools/gperftools/blob/bf8b714bf5075d0a6f2f28504b43095e2b1e11c5/src/tcmalloc.cc#L404? Do you have any idea what is wrong in my prints or is there a better way to output DumpStats(), i.e. how should I call static void DumpStats(TCMalloc_Printer* out, int level) in tcmalloc.cc?

Thanks a lot!

Aug 23 '22 12:08 caoyukun0430

Hey @caoyukun0430,

I ran

[01:53:04] kbaichoo:envoy:0 $bazel test --define tcmalloc=gperftools  //test/common/memory:heap_shrinker_test --test_output=all --test_arg="-l debug"
INFO: Build option --define has changed, discarding analysis cache.
INFO: Analyzed target //test/common/memory:heap_shrinker_test (1 packages loaded, 18943 targets configured).
INFO: Found 1 test target...
INFO: From Testing //test/common/memory:heap_shrinker_test:
==================== Test output for //test/common/memory:heap_shrinker_test:
WARNING: Perftools heap leak checker is active -- Performance may suffer
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from HeapShrinkerTest
[ RUN      ] HeapShrinkerTest.DoNotShrinkWhenNotConfigured
[2022-08-23 14:00:19.134][12][info][misc] [test/test_common/utility.cc:63] TestRandomGenerator running with seed -1763186326
[2022-08-23 14:00:19.134][12][info][misc] [test/test_common/utility.cc:63] TestRandomGenerator running with seed -1763186326
[2022-08-23 14:00:19.137][12][debug][main] [source/common/event/dispatcher_impl.cc:90] destroying dispatcher test_thread
[       OK ] HeapShrinkerTest.DoNotShrinkWhenNotConfigured (4 ms)
[ RUN      ] HeapShrinkerTest.ShrinkWhenTriggered
[2022-08-23 14:00:19.138][12][info][misc] [test/test_common/utility.cc:63] TestRandomGenerator running with seed -1763186326
[2022-08-23 14:00:19.138][12][info][misc] [test/test_common/utility.cc:63] TestRandomGenerator running with seed -1763186326
[2022-08-23 14:00:19.139][12][warning][misc] [test/test_common/simulated_time_system.cc:336] Simulated timer enabled. Use advanceTimeWait or advanceTimeAsync functions to ensure it is called.
[2022-08-23 14:00:19.143][12][debug][misc] [source/common/memory/stats.cc:100] GPERFTOOLS TCMalloc stats:
------------------------------------------------
MALLOC:        1361592 (    1.3 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +       375392 (    0.4 MiB) Bytes in central cache freelist
MALLOC: +            0 (    0.0 MiB) Bytes in transfer cache freelist
MALLOC: +       245480 (    0.2 MiB) Bytes in thread cache freelists
MALLOC: +      2621440 (    2.5 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =      4603904 (    4.4 MiB) Actual memory used (physical + swap)
MALLOC: +      5120000 (    4.9 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =      9723904 (    9.3 MiB) Virtual address space used
MALLOC:
MALLOC:            171              Spans in use
MALLOC:              2              Thread heaps in use
MALLOC:           8192              Tcmalloc page size

which invokes Stats::dumpStatsToLog() and it seems to output the expected result. Perhaps something earlier in the pipeline is broken? Do you get the same output if you run the same command above?

Aug 23 '22 14:08 KBaichoo

Hi @KBaichoo thanks for the test! It was my fault that the grep filter out TCMALLOC logs XD, here are the results and plz see my comments below

Neat, can I get the output of the dumpStatsToLog(), it should look below and be even more detailed. https://github.com/gperftools/gperftools/blob/bf8b714bf5075d0a6f2f28504b43095e2b1e11c5/src/tcmalloc.cc#L404

Yukun: I did same test and here are the results, bascially /stats and dumpStatsToLog() imply the same parameters, and we still saw memory released from allocated comes to central_cache_free_bytes and transfer_cache_free_bytes.

######## when stop_accept_request triggers
[11:33]$ kubectl exec -ti pod/istio-ingressgateway-6997b477cb-z9mhv -n istio-system -- curl localhost:15000/memory
{
 "allocated": "32626144",
 "heap_size": "56623104",
 "pageheap_unmapped": "393216",
 "pageheap_free": "0",
 "total_thread_cache": "15727104",
 "total_physical_bytes": "59113472"
}
[11:33]$ kubectl exec -ti pod/istio-ingressgateway-6997b477cb-z9mhv -n istio-system -- curl localhost:15000/stats | grep -i memory
server.memory_allocated: 32610608
server.memory_central_cache_free_bytes: 3543072
server.memory_heap_size: 56623104
server.memory_physical_size: 59121664
server.memory_thread_cache_free_bytes: 15743152
server.memory_transfer_cache_free_bytes: 4341248
$ k logs istio-ingressgateway-6997b477cb-z9mhv -f | grep -i MALLOC
2022-08-24T11:33:23.178762Z	warning	envoy misc	TCMalloc stats:
MALLOC:       32758576 (   31.2 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +      3452192 (    3.3 MiB) Bytes in central cache freelist
MALLOC: +      4319232 (    4.1 MiB) Bytes in transfer cache freelist
MALLOC: +     15781808 (   15.1 MiB) Bytes in thread cache freelists
MALLOC: +      2883584 (    2.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =     59195392 (   56.5 MiB) Actual memory used (physical + swap)
MALLOC: +       311296 (    0.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =     59506688 (   56.8 MiB) Virtual address space used
MALLOC:
MALLOC:           5817              Spans in use
MALLOC:              9              Thread heaps in use
MALLOC:           8192              Tcmalloc page size

####### one hour later allocated mem are released as I describe before
[12:37]$ kubectl exec -ti pod/istio-ingressgateway-6997b477cb-z9mhv -n istio-system -- curl localhost:15000/memory
{
 "allocated": "8672096",
 "heap_size": "57671680",
 "pageheap_unmapped": "3465216",
 "pageheap_free": "0",
 "total_thread_cache": "18984904",
 "total_physical_bytes": "57090048"
}
[12:37]$ kubectl exec -ti pod/istio-ingressgateway-6997b477cb-z9mhv -n istio-system -- curl localhost:15000/stats | grep -i memory
server.memory_allocated: 8656560
server.memory_central_cache_free_bytes: 16314584
server.memory_heap_size: 57671680
server.memory_physical_size: 57073664
server.memory_thread_cache_free_bytes: 18988152
server.memory_transfer_cache_free_bytes: 10230784
$ k logs istio-ingressgateway-6997b477cb-z9mhv -f | grep -i MALLOC
2022-08-24T12:34:48.977299Z	warning	envoy misc	TCMalloc stats:
MALLOC:        8781600 (    8.4 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +     16307416 (   15.6 MiB) Bytes in central cache freelist
MALLOC: +     10170368 (    9.7 MiB) Bytes in transfer cache freelist
MALLOC: +     18947080 (   18.1 MiB) Bytes in thread cache freelists
MALLOC: +      2883584 (    2.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =     57090048 (   54.4 MiB) Actual memory used (physical + swap)
MALLOC: +      3465216 (    3.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =     60555264 (   57.8 MiB) Virtual address space used
MALLOC:
MALLOC:           5839              Spans in use
MALLOC:              9              Thread heaps in use
MALLOC:           8192              Tcmalloc page size

A large central freelist implies we have a longtail of memory spans (~a set of pages) sitting idle. If we had the data from dumpStatsToLog above, we might be able to spot particular fragmentation in a size class.

Yukun: I didn't spot some more info from dumpStatsToLog() as results are similar to /stats, so I think my conclusion still applies: (part of )central_cache_free_bytes AND transfer_cache_free_bytes should also be considered as free memory and discount from used mem in context of fixed heap monitor. "All the other free memory cannot be easily returned to the OS due pieces of that chunk of memory is in use" so you think the memory released from allocated to central_cache_free_bytes AND transfer_cache_free_bytes will stay there and will not be returned to OS due to in-use, so it should not be counted as "free mem regions" in heap_monitor? Then I would ask: Is that only memory that would be returned to OS be discounted from heap monitor, but those in cache will not? (I thought the mem in cache will be used by traffic next time at the first place, then OS will allocate more mem if running out of cache, so to me, it should be considered as free mem region). What is your opinion?

You might get some mileage from setting the env var TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to be lower to reduce how big the thread caches get. It's a tradeoff of cpu for memory, but might be fine for a smaller deployment.

Yukun: memory_thread_cache_free_bytes honstly won't be a big issue, as we can see, it doesn't change much during and after traffic (15~18Mb), it's more the issue between allocated and central_cache_free_bytes AND transfer_cache_free_bytes.

Aug 24 '22 11:08 caoyukun0430

Hey @caoyukun0430

There should be more info. Sorry for not being clear about the output of gperftools TCMALLOC, it should also include information about size classes e.g. like this

        MALLOC:           8192              Tcmalloc page size
        ------------------------------------------------
        Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
        Bytes released to the OS take up virtual address space but no physical memory.
        ------------------------------------------------
        Total size of freelists for per-thread caches,
        transfer cache, and central cache, by size class
        ------------------------------------------------
        class   1 [        8 bytes ] :      863 objs;   0.0 MiB;   0.0 cum MiB;    0.000 overhead MiB;    0.000 cum overhead MiB
        class   2 [       16 bytes ] :      425 objs;   0.0 MiB;   0.0 cum MiB;    0.000 overhead MiB;    0.000 cum overhead MiB
        class   3 [       32 bytes ] :      899 objs;   0.0 MiB;   0.0 cum MiB;    0.000 overhead MiB;    0.000 cum overhead MiB
        class   4 [       48 bytes ] :      221 objs;   0.0 MiB;   0.1 cum MiB;    0.001 overhead MiB;    0.001 cum overhead MiB
        class   5 [       64 bytes ] :      172 objs;   0.0 MiB;   0.1 cum MiB;    0.000 overhead MiB;    0.001 cum overhead MiB
        class   6 [       80 bytes ] :      565 objs;   0.0 MiB;   0.1 cum MiB;    0.000 overhead MiB;    0.001 cum overhead MiB
        class   7 [       96 bytes ] :      341 objs;   0.0 MiB;   0.1 cum MiB;    0.000 overhead MiB;    0.001 cum overhead MiB
        class   8 [      112 bytes ] :        7 objs;   0.0 MiB;   0.1 cum MiB;    0.000 overhead MiB;    0.001 cum overhead MiB
        class   9 [      128 bytes ] :       42 objs;   0.0 MiB;   0.1 cum MiB;    0.000 overhead MiB;    0.001 cum overhead MiB
        class  10 [      144 bytes ] :       20 objs;   0.0 MiB;   0.1 cum MiB;    0.000 overhead MiB;    0.002 cum overhead MiB
        class  11 [      160 bytes ] :       46 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.002 cum overhead MiB
        class  12 [      176 bytes ] :       38 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.002 cum overhead MiB
        class  13 [      192 bytes ] :       68 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.002 cum overhead MiB
        class  15 [      224 bytes ] :        7 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.003 cum overhead MiB
        class  16 [      240 bytes ] :       58 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.003 cum overhead MiB
        class  17 [      256 bytes ] :       25 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.003 cum overhead MiB
        class  18 [      288 bytes ] :       20 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.003 cum overhead MiB
        class  19 [      320 bytes ] :       25 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.003 cum overhead MiB
        class  20 [      352 bytes ] :       21 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.003 cum overhead MiB
        class  21 [      384 bytes ] :       19 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.003 cum overhead MiB
        class  22 [      416 bytes ] :       17 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.003 cum overhead MiB
        class  23 [      448 bytes ] :       18 objs;   0.0 MiB;   0.2 cum MiB;    0.001 overhead MiB;    0.005 cum overhead MiB
        class  24 [      480 bytes ] :       13 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.005 cum overhead MiB
        class  25 [      512 bytes ] :       13 objs;   0.0 MiB;   0.2 cum MiB;    0.000 overhead MiB;    0.005 cum overhead MiB
        class  26 [      576 bytes ] :        7 objs;   0.0 MiB;   0.2 cum MiB;    0.001 overhead MiB;    0.006 cum overhead MiB
        class  27 [      640 bytes ] :        9 objs;   0.0 MiB;   0.3 cum MiB;    0.006 overhead MiB;    0.013 cum overhead MiB
        class  28 [      704 bytes ] :        8 objs;   0.0 MiB;   0.3 cum MiB;    0.000 overhead MiB;    0.013 cum overhead MiB
        class  29 [      768 bytes ] :       10 objs;   0.0 MiB;   0.3 cum MiB;    0.000 overhead MiB;    0.014 cum overhead MiB
        class  30 [      896 bytes ] :        7 objs;   0.0 MiB;   0.3 cum MiB;    0.000 overhead MiB;    0.014 cum overhead MiB
        class  31 [     1024 bytes ] :        6 objs;   0.0 MiB;   0.3 cum MiB;    0.000 overhead MiB;    0.014 cum overhead MiB
        class  32 [     1152 bytes ] :       11 objs;   0.0 MiB;   0.3 cum MiB;    0.001 overhead MiB;    0.015 cum overhead MiB
        class  34 [     1408 bytes ] :        9 objs;   0.0 MiB;   0.3 cum MiB;    0.001 overhead MiB;    0.015 cum overhead MiB
        class  35 [     1536 bytes ] :       10 objs;   0.0 MiB;   0.3 cum MiB;    0.001 overhead MiB;    0.016 cum overhead MiB
        class  36 [     1792 bytes ] :        7 objs;   0.0 MiB;   0.3 cum MiB;    0.000 overhead MiB;    0.017 cum overhead MiB
        class  37 [     2048 bytes ] :        5 objs;   0.0 MiB;   0.3 cum MiB;    0.000 overhead MiB;    0.017 cum overhead MiB
        class  38 [     2304 bytes ] :        2 objs;   0.0 MiB;   0.3 cum MiB;    0.000 overhead MiB;    0.017 cum overhead MiB
        class  39 [     2560 bytes ] :        5 objs;   0.0 MiB;   0.4 cum MiB;    0.001 overhead MiB;    0.018 cum overhead MiB
        class  40 [     2816 bytes ] :        6 objs;   0.0 MiB;   0.4 cum MiB;    0.002 overhead MiB;    0.020 cum overhead MiB
        class  41 [     3072 bytes ] :        3 objs;   0.0 MiB;   0.4 cum MiB;    0.001 overhead MiB;    0.021 cum overhead MiB
        class  42 [     3328 bytes ] :        6 objs;   0.0 MiB;   0.4 cum MiB;    0.001 overhead MiB;    0.022 cum overhead MiB
        class  43 [     4096 bytes ] :        4 objs;   0.0 MiB;   0.4 cum MiB;    0.000 overhead MiB;    0.022 cum overhead MiB
        class  44 [     4608 bytes ] :        3 objs;   0.0 MiB;   0.4 cum MiB;    0.001 overhead MiB;    0.024 cum overhead MiB
        class  46 [     6144 bytes ] :        3 objs;   0.0 MiB;   0.4 cum MiB;    0.000 overhead MiB;    0.024 cum overhead MiB
        class  47 [     6656 bytes ] :        5 objs;   0.0 MiB;   0.5 cum MiB;    0.001 overhead MiB;    0.025 cum overhead MiB
        class  48 [     8192 bytes ] :        1 objs;   0.0 MiB;   0.5 cum MiB;    0.000 overhead MiB;    0.025 cum overhead MiB
        class  49 [     9216 bytes ] :        2 objs;   0.0 MiB;   0.5 cum MiB;    0.004 overhead MiB;    0.028 cum overhead MiB
        class  50 [    10240 bytes ] :        2 objs;   0.0 MiB;   0.5 cum MiB;    0.002 overhead MiB;    0.030 cum overhead MiB
        class  58 [    40960 bytes ] :        1 objs;   0.0 MiB;   0.6 cum MiB;    0.000 overhead MiB;    0.030 cum overhead MiB
        ------------------------------------------------
        PageHeap: 0 sizes;    0.0 MiB free;    4.9 MiB unmapped
        ------------------------------------------------
        128   large *      1 spans ~    4.9 MiB;    4.9 MiB cum; unmapped:    4.9
                         MiB;    4.9 MiB cum

Knowing this information would be very useful for seeing if there's a particular class that ends up fragmented. If there is, and we can defragment (perhaps by changing usage pattern) then the spans used could be freed such that they end up without allocations and get returned to the OS as empty pages. If nothing stands out we can dig into alternatives, but I think this is the best approach.

Aug 24 '22 14:08 KBaichoo

Hi @KBaichoo , no worries, here is the data, not sure how to see if a class is fragmented or not ;)

2022-08-24T18:43:22.388818Z	warning	envoy misc	TCMalloc stats:
------------------------------------------------
MALLOC:        8773504 (    8.4 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +     16792536 (   16.0 MiB) Bytes in central cache freelist
MALLOC: +     10132736 (    9.7 MiB) Bytes in transfer cache freelist
MALLOC: +     17770408 (   16.9 MiB) Bytes in thread cache freelists
MALLOC: +      2883584 (    2.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =     56352768 (   53.7 MiB) Actual memory used (physical + swap)
MALLOC: +      4202496 (    4.0 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =     60555264 (   57.8 MiB) Virtual address space used
MALLOC:
MALLOC:           5821              Spans in use
MALLOC:              9              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
------------------------------------------------
Total size of freelists for per-thread caches,
transfer cache, and central cache, by size class
------------------------------------------------
class   1 [        8 bytes ] :     1002 objs;   0.0 MiB;   0.0 cum MiB;    0.000 overhead MiB;    0.000 cum overhead MiB
class   2 [       16 bytes ] :    25598 objs;   0.4 MiB;   0.4 cum MiB;    0.000 overhead MiB;    0.000 cum overhead MiB
class   3 [       32 bytes ] :    57746 objs;   1.8 MiB;   2.2 cum MiB;    0.000 overhead MiB;    0.000 cum overhead MiB
class   4 [       48 bytes ] :    35100 objs;   1.6 MiB;   3.8 cum MiB;    0.009 overhead MiB;    0.009 cum overhead MiB
class   5 [       64 bytes ] :    20136 objs;   1.2 MiB;   5.0 cum MiB;    0.000 overhead MiB;    0.009 cum overhead MiB
class   6 [       80 bytes ] :    10694 objs;   0.8 MiB;   5.8 cum MiB;    0.004 overhead MiB;    0.013 cum overhead MiB
class   7 [       96 bytes ] :      261 objs;   0.0 MiB;   5.8 cum MiB;    0.001 overhead MiB;    0.014 cum overhead MiB
class   8 [      112 bytes ] :    12380 objs;   1.3 MiB;   7.2 cum MiB;    0.003 overhead MiB;    0.017 cum overhead MiB
class   9 [      128 bytes ] :    11764 objs;   1.4 MiB;   8.6 cum MiB;    0.000 overhead MiB;    0.017 cum overhead MiB
class  10 [      144 bytes ] :     1235 objs;   0.2 MiB;   8.8 cum MiB;    0.004 overhead MiB;    0.021 cum overhead MiB
class  11 [      160 bytes ] :     1201 objs;   0.2 MiB;   8.9 cum MiB;    0.001 overhead MiB;    0.021 cum overhead MiB
class  12 [      176 bytes ] :     2282 objs;   0.4 MiB;   9.3 cum MiB;    0.005 overhead MiB;    0.026 cum overhead MiB
class  13 [      192 bytes ] :     3113 objs;   0.6 MiB;   9.9 cum MiB;    0.010 overhead MiB;    0.036 cum overhead MiB
class  14 [      208 bytes ] :     1158 objs;   0.2 MiB;  10.1 cum MiB;    0.002 overhead MiB;    0.039 cum overhead MiB
class  15 [      224 bytes ] :     2261 objs;   0.5 MiB;  10.6 cum MiB;    0.008 overhead MiB;    0.047 cum overhead MiB
class  16 [      240 bytes ] :     3347 objs;   0.8 MiB;  11.4 cum MiB;    0.003 overhead MiB;    0.050 cum overhead MiB
class  17 [      256 bytes ] :     1991 objs;   0.5 MiB;  11.9 cum MiB;    0.000 overhead MiB;    0.050 cum overhead MiB
class  18 [      288 bytes ] :     2853 objs;   0.8 MiB;  12.6 cum MiB;    0.013 overhead MiB;    0.063 cum overhead MiB
class  19 [      320 bytes ] :    16365 objs;   5.0 MiB;  17.6 cum MiB;    0.120 overhead MiB;    0.183 cum overhead MiB
class  20 [      352 bytes ] :       36 objs;   0.0 MiB;  17.7 cum MiB;    0.000 overhead MiB;    0.184 cum overhead MiB
class  21 [      384 bytes ] :       40 objs;   0.0 MiB;  17.7 cum MiB;    0.000 overhead MiB;    0.184 cum overhead MiB
class  22 [      416 bytes ] :     5175 objs;   2.1 MiB;  19.7 cum MiB;    0.076 overhead MiB;    0.260 cum overhead MiB
class  23 [      448 bytes ] :     1388 objs;   0.6 MiB;  20.3 cum MiB;    0.012 overhead MiB;    0.272 cum overhead MiB
class  24 [      480 bytes ] :     2208 objs;   1.0 MiB;  21.3 cum MiB;    0.004 overhead MiB;    0.277 cum overhead MiB
class  25 [      512 bytes ] :     1511 objs;   0.7 MiB;  22.1 cum MiB;    0.000 overhead MiB;    0.277 cum overhead MiB
class  26 [      576 bytes ] :       29 objs;   0.0 MiB;  22.1 cum MiB;    0.004 overhead MiB;    0.281 cum overhead MiB
class  27 [      640 bytes ] :     3279 objs;   2.0 MiB;  24.1 cum MiB;    0.149 overhead MiB;    0.430 cum overhead MiB
class  28 [      704 bytes ] :     1356 objs;   0.9 MiB;  25.0 cum MiB;    0.053 overhead MiB;    0.483 cum overhead MiB
class  29 [      768 bytes ] :       19 objs;   0.0 MiB;  25.0 cum MiB;    0.002 overhead MiB;    0.485 cum overhead MiB
class  30 [      896 bytes ] :     6234 objs;   5.3 MiB;  30.3 cum MiB;    0.085 overhead MiB;    0.570 cum overhead MiB
class  31 [     1024 bytes ] :      858 objs;   0.8 MiB;  31.2 cum MiB;    0.000 overhead MiB;    0.570 cum overhead MiB
class  32 [     1152 bytes ] :     3261 objs;   3.6 MiB;  34.8 cum MiB;    0.059 overhead MiB;    0.629 cum overhead MiB
class  33 [     1280 bytes ] :        3 objs;   0.0 MiB;  34.8 cum MiB;    0.002 overhead MiB;    0.631 cum overhead MiB
class  34 [     1408 bytes ] :     1362 objs;   1.8 MiB;  36.6 cum MiB;    0.106 overhead MiB;    0.737 cum overhead MiB
class  35 [     1536 bytes ] :     1106 objs;   1.6 MiB;  38.2 cum MiB;    0.110 overhead MiB;    0.847 cum overhead MiB
class  36 [     1792 bytes ] :        2 objs;   0.0 MiB;  38.2 cum MiB;    0.000 overhead MiB;    0.848 cum overhead MiB
class  37 [     2048 bytes ] :        4 objs;   0.0 MiB;  38.2 cum MiB;    0.000 overhead MiB;    0.848 cum overhead MiB
class  38 [     2304 bytes ] :        5 objs;   0.0 MiB;  38.2 cum MiB;    0.040 overhead MiB;    0.888 cum overhead MiB
class  39 [     2560 bytes ] :        7 objs;   0.0 MiB;  38.2 cum MiB;    0.008 overhead MiB;    0.895 cum overhead MiB
class  40 [     2816 bytes ] :        6 objs;   0.0 MiB;  38.3 cum MiB;    0.002 overhead MiB;    0.897 cum overhead MiB
class  41 [     3072 bytes ] :        5 objs;   0.0 MiB;  38.3 cum MiB;    0.004 overhead MiB;    0.901 cum overhead MiB
class  42 [     3328 bytes ] :        6 objs;   0.0 MiB;  38.3 cum MiB;    0.002 overhead MiB;    0.904 cum overhead MiB
class  43 [     4096 bytes ] :      759 objs;   3.0 MiB;  41.3 cum MiB;    0.000 overhead MiB;    0.904 cum overhead MiB
class  44 [     4608 bytes ] :        4 objs;   0.0 MiB;  41.3 cum MiB;    0.001 overhead MiB;    0.905 cum overhead MiB
class  45 [     5120 bytes ] :        3 objs;   0.0 MiB;  41.3 cum MiB;    0.001 overhead MiB;    0.906 cum overhead MiB
class  46 [     6144 bytes ] :        5 objs;   0.0 MiB;  41.3 cum MiB;    0.000 overhead MiB;    0.906 cum overhead MiB
class  47 [     6656 bytes ] :        6 objs;   0.0 MiB;  41.4 cum MiB;    0.001 overhead MiB;    0.907 cum overhead MiB
class  48 [     8192 bytes ] :        5 objs;   0.0 MiB;  41.4 cum MiB;    0.000 overhead MiB;    0.907 cum overhead MiB
class  49 [     9216 bytes ] :        7 objs;   0.1 MiB;  41.5 cum MiB;    0.008 overhead MiB;    0.915 cum overhead MiB
class  51 [    12288 bytes ] :        2 objs;   0.0 MiB;  41.5 cum MiB;    0.000 overhead MiB;    0.915 cum overhead MiB
class  52 [    13312 bytes ] :        1 objs;   0.0 MiB;  41.5 cum MiB;    0.001 overhead MiB;    0.916 cum overhead MiB
class  53 [    16384 bytes ] :        8 objs;   0.1 MiB;  41.6 cum MiB;    0.000 overhead MiB;    0.916 cum overhead MiB
class  54 [    20480 bytes ] :        3 objs;   0.1 MiB;  41.7 cum MiB;    0.000 overhead MiB;    0.916 cum overhead MiB
class  56 [    26624 bytes ] :        1 objs;   0.0 MiB;  41.7 cum MiB;    0.004 overhead MiB;    0.920 cum overhead MiB
------------------------------------------------
PageHeap: 7 sizes;    0.0 MiB free;    4.0 MiB unmapped
------------------------------------------------
     1 pages *    243 spans ~    1.9 MiB;    1.9 MiB cum; unmapped:    1.9 MiB;    1.9 MiB cum
     2 pages *     74 spans ~    1.2 MiB;    3.1 MiB cum; unmapped:    1.2 MiB;    3.1 MiB cum
     3 pages *     12 spans ~    0.3 MiB;    3.3 MiB cum; unmapped:    0.3 MiB;    3.3 MiB cum
     4 pages *      3 spans ~    0.1 MiB;    3.4 MiB cum; unmapped:    0.1 MiB;    3.4 MiB cum
     5 pages *      2 spans ~    0.1 MiB;    3.5 MiB cum; unmapped:    0.1 MiB;    3.5 MiB cum
     7 pages *      1 spans ~    0.1 MiB;    3.6 MiB cum; unmapped:    0.1 MiB;    3.6 MiB cum
    57 pages *      1 spans ~    0.4 MiB;    4.0 MiB cum; unmapped:    0.4 MiB;    4.0 MiB cum
>128   large *      0 spans ~    0.0 MiB;    4.0 MiB cum; unmapped:    0.0 MiB;    4.0 MiB cum

Aug 24 '22 18:08 caoyukun0430

Unfortunately, nothing stands out to me with the fragmentation here :(.

Here are some approaches I think we could take:

Try using gperftools "TCMALLOC_HEAP_LIMIT_MB" env var to try to adjust the allocators behavior. Not sure it'll solve the issue as I'm not a gperftool TCMalloc expert. https://github.com/gperftools/gperftools/issues/451 and change https://github.com/gperftools/gperftools/commit/6f6c2bf68576e43491884364fc05c9ed752adb4f#diff-1add31c814fb4eaf879e2e014ba2a6bffac60e67f2abeca2e106ba493d121e7bR53
Allocate a higher ceiling for the task such that the TCMalloc fragmentation in freelist doesn't easily end up freezing up the job.
Turn off overload manager for small tasks like this; I think istio's approach is this and rely on the task getting killed if goes over.
Discount TCMalloc freelist, but that memory is actually still consumed and would be used by systems to kill the task, so I don't think that's the right approach.

Aug 29 '22 20:08 KBaichoo

Unfortunately, nothing stands out to me with the fragmentation here :(. Yukun: I am curious to know how to read the outputs if you could explain a bit?

Here are some approaches I think we could take: Yukun: After I would like to focus more on would it be possible to release more free memory(typically central_cache_free_bytes AND transfer_cache_free_bytes due to the large increase) as I stated in my last comment: "All the other free memory cannot be easily returned to the OS due pieces of that chunk of memory is in use" so you think the memory released from allocated to central_cache_free_bytes AND transfer_cache_free_bytes will stay there and will not be returned to OS due to in-use, so it should not be counted as "free mem regions" in heap_monitor? Then I would ask: Is that the case: only memory that would be returned to OS be discounted from heap monitor, but those in cache will not? (I thought the mem in cache will be used by traffic next time at the first place, then OS will allocate more mem if running out of cache, so to me, it should be considered as free mem region). What is your opinion?

Try using gperftools "TCMALLOC_HEAP_LIMIT_MB" env var to try to adjust the allocators behavior. Not sure it'll solve the issue as I'm not a gperftool TCMalloc expert. max memory setting gperftools/gperftools#451 and change gperftools/gperftools@6f6c2bf#diff-1add31c814fb4eaf879e2e014ba2a6bffac60e67f2abeca2e106ba493d121e7bR53 Yukun: From the link TCMALLOC_HEAP_LIMIT_MB - The maximum amount of heap memory that tcmalloc can use, so to me it's more a different way to set "max_heap_size_bytes", but I would think this is not what I would focus on.

Allocate a higher ceiling for the task such that the TCMalloc fragmentation in freelist doesn't easily end up freezing up the job.

Turn off overload manager for small tasks like this; I think istio's approach is this and rely on the task getting killed if goes over.

Discount TCMalloc freelist, but that memory is actually still consumed and would be used by systems to kill the task, so I don't think that's the right approach.

Yukun: Overall, the approaches you mentioned would be useful, but as I mentioned, I would like to focus on would it be possible to release more free memory, typically those freed from allocated to typically central_cache_free_bytes AND transfer_cache_free_bytes. What's your opinion on considering (part of) them into "free memory regions" to be released? Or does the data above proves that they can not be released?

Aug 31 '22 18:08 caoyukun0430

TCMalloc cannot release them as-is, they are stuck in freelists and count against the memory used by the process.

Aug 31 '22 20:08 KBaichoo

Hi @KBaichoo Sorry for the late reply. After some discussion, I think there is another approach worth trying out, basically as you mentioned before: "You might get some mileage from setting the env var TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to be lower to reduce how big the thread caches get. It's a tradeoff of cpu for memory, but might be fine for a smaller deployment." Since we saw mem are freed from allocated to central_cache_free_bytes AND transfer_cache_free_bytes. The idea would be limit the max bytes of these two parameters to smaller values, so that memory are forced to returned to OS, and I agree it's trade-off but worth trying.

So I would like to ask

which is the correct way to set the env var TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES since I use istio? Can I do like here https://github.com/istio/istio/issues/29395 -set meshConfig.defaultConfig.proxyMetadata.TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES =10000000?
are there env var for central_cache_free_bytes AND transfer_cache_free_bytes? Is it possible to set max value for them?

Sep 06 '22 13:09 caoyukun0430

which is the correct way to set the env var TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES since I use istio? Can I do like here https://github.com/istio/istio/issues/29395 -set meshConfig.defaultConfig.proxyMetadata.TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES =10000000?

No idea, I don't use Istio.

are there env var for central_cache_free_bytes AND transfer_cache_free_bytes? Is it possible to set max value for them? I think only

Not direct knobs AFAIK. Perhaps using gperftools "TCMALLOC_HEAP_LIMIT_MB" env var to try to adjust the allocators behavior would influence those. Probably better suited question for https://github.com/gperftools/gperftools

Sep 06 '22 22:09 KBaichoo

@KBaichoo I tried to set directly these two env var in my ingressgw pod env fields and tested again, but results didn't change. I am afraid my setting is not correct. So I would like to ask how would you set it in envoy and how to verify the env vars are set or not in enovy? (because I checked in envoy codes and didn't find sth related to TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES so wondering how this env var is used by envoy)

Second, my tests are mainly focused on small mem usage O(50Mb) ~ O(100Mb), would the memory release behavior be more aggressive, meaning OS interferences more is we have larger memory comsumed like O(1Gb)? Do you or any colleague would know more?

Sep 13 '22 07:09 caoyukun0430

envoy envoy copied to clipboard

memory heap not released long after overload manager actions triggered

envoy
envoy copied to clipboard