envoy Envoy does not release memory when upstream is websocket server

Description

Envoy does not release memory when upstream is websocket server
- does not exist the problem if upstream is http server

Repro steps

startup a envoy container to make envoy as proxy pointing to websocket server

docker run -v /envoy-websocket.yaml:/etc/envoy/envoy.yaml  -d  --name envoy-websocket-container -p 9901:9901 -p 10000:10000 \
envoyproxy/envoy:v1.28-latest  -c /etc/envoy/envoy.yaml

config(/envoy-websocket.yaml)

$ cat /envoy-websocket.yaml
admin:
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901

static_resources:
  listeners:
  - address:
      socket_address:
        address: 0.0.0.0
        port_value: 10000
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_ws_to_ws
          upgrade_configs:
          - upgrade_type: websocket
          route_config:
            name: local_route
            virtual_hosts:
            - name: app
              domains:
              - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: service_ws
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
  - name: service_ws
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: service_ws
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: xxx.xxx.xxx.xxx // this is my local websocket address
                port_value: 80

Fyi, i refered this article: https://www.envoyproxy.io/docs/envoy/latest/start/sandboxes/websocket

send requests to envoy address using below client codes

click here to check client codes


package main
import (
"fmt"
"github.com/gorilla/websocket"
"strconv"
"strings"
"sync"
"time"
)
func main() {
// raw envoy
serverURL := "ws://$myEnvoyAddress:10000"
connectionCount := "500"
durationSeconds := "300"
interval := "500"
payloadSize := "16"
  // Parse durationSeconds from string to int
  duration, err := time.ParseDuration(durationSeconds + "s")
  if err != nil {
  	fmt.Println("Invalid duration provided:", err)
  	return
  }

  // Convert connectionCount from string to int
  connCount, err := strconv.Atoi(connectionCount)
  if err != nil {
  	fmt.Println("Invalid connection count provided:", err)
  	return
  }

  sz, err := strconv.Atoi(payloadSize)
  if err != nil {
  	fmt.Println("Invalid connection count provided:", err)
  	return
  }

  payload := strings.Repeat("a", sz*1024) // if sz==16, set 16KB payload

  fmt.Println("Connecting to server:", serverURL)

  // Create a wait group to wait for all goroutines to finish
  var wg sync.WaitGroup

  // Store latency values for later calculation
  var latencies []time.Duration
  totalConnCount := 0

  errorCount := 0
  uniqueErrors := make(map[string]int)
  errorMutex := sync.Mutex{}

  // Perform connections and communication
  for i := 0; i < connCount; i++ {
  	wg.Add(1)
  	go func() {
  		defer wg.Done()

  		conn, _, err := websocket.DefaultDialer.Dial(serverURL, nil)
  		if err != nil {
  			fmt.Println("WebSocket connection error:", err)
  			errorMutex.Lock()
  			if strings.Contains(err.Error(), "read: connection reset by peer") {
  				uniqueErrors["read: connection reset by peer"]++
  			} else if strings.Contains(err.Error(), "write: broken pipe") {
  				uniqueErrors["write: broken pipe"]++
  			} else {
  				uniqueErrors[err.Error()]++
  			}
  			errorCount++
  			errorMutex.Unlock()
  			return
  		}
  		totalConnCount++
  		defer conn.Close()

  		// Send and receive messages for the specified duration
  		startTime := time.Now()

  		for time.Since(startTime) < duration {
  			// Send message
  			sendTime := time.Now()
  			err = conn.WriteMessage(websocket.TextMessage, []byte(payload))
  			if err != nil {
  				fmt.Println("Message send error:", err)
  				return
  			}

  			// Receive response with timeout
  			_, response, err := conn.ReadMessage()
  			if err != nil {
  				fmt.Println("Message receive error:", err)
  				errorMutex.Lock()
  				if strings.Contains(err.Error(), "read: connection reset by peer") {
  					uniqueErrors["read: connection reset by peer"]++
  				} else if strings.Contains(err.Error(), "write: broken pipe") {
  					uniqueErrors["write: broken pipe"]++
  				} else {
  					uniqueErrors[err.Error()]++
  				}
  				errorCount++
  				errorMutex.Unlock()
  				return
  			}

  			messageSize := len(response)
  			sizeString := fmt.Sprintf("%d Byte", messageSize)
  			if messageSize >= 1024 {
  				sizeString = fmt.Sprintf("%.2f KB", float64(messageSize)/1024)
  			}

  			// Calculate and store latency
  			endTime := time.Now()
  			latency := endTime.Sub(sendTime)
  			latencies = append(latencies, latency)

  			fmt.Printf("%d/%d %d:%d:%d Received response, size: %s\n", endTime.Month(), endTime.Day(), endTime.Hour(), endTime.Minute(), endTime.Second(), sizeString)

  			// Wait for 0.5 seconds before sending the next message
  			iint, err := strconv.Atoi(interval)
  			if err != nil {
  				fmt.Println("Convert str to int error:", err)
  			}
  			time.Sleep(time.Duration(iint) * time.Millisecond)
  		}
  	}()
  }

  // Wait for all goroutines to finish
  wg.Wait()

  // Calculate and display average latency
  totalLatency := time.Duration(0)
  for _, latency := range latencies {
  	totalLatency += latency
  }
  //totalCount := len(latencies)
  averageLatency := totalLatency / time.Duration(len(latencies))
  fmt.Println("Average latency:", averageLatency)
  fmt.Println("Total send messages:", len(latencies))
  fmt.Println("Total connections:", totalConnCount)
  // Print unique error messages and count
  fmt.Println("Total errors:", errorCount)
  for errMsg := range uniqueErrors {
  	fmt.Println("Error:", errMsg, "(", uniqueErrors[errMsg], ")")
  }

}

check memory using ctop tool

Before send requests to envoy

We can see, the memory is 16M initally.
After send requests to envoy

We can see, current memory is 39M. I waited about at least 30.minutes, the memory is 39M still.

My question

Is the envoy itself bug?(I used envoy's lastest image to test)
If not, what did i miss?

Jan 12 '24 02:01 ningyougang

cc @alyssawilk

Jan 12 '24 10:01 phlax

I'd take a look at your stats and see if the connection is considered closed by Envoy or not.

Jan 16 '24 15:01 alyssawilk

@alyssawilk

`downstream_cx_upgrades_active` metric (mean `Total active upgraded connections`)

Before run above client codes

# curl localhost:9901/stats|grep downstream_cx_upgrades_active
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21534    0 21534  http.admin.downstream_cx_upgrades_active: 0
http.ingress_ws_to_ws.downstream_cx_upgrades_active: 0

Check the downstream_cx_upgrades_active metric during client codes running

# curl localhost:9901/stats|grep downstream_cx_upgrades_active
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21534    0 21534    0     0  3315k      0 --:--:-- --:--:-- --:--:-- 3504k
http.admin.downstream_cx_upgrades_active: 0
http.ingress_ws_to_ws.downstream_cx_upgrades_active: 500

http.ingress_ws_to_ws.downstream_cx_upgrades_active is 500

Check the downstream_cx_upgrades_active metric after client codes ran finished(almost check immediately)

# curl localhost:9901/stats|grep downstream_cx_upgrades_active
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21477    0 21477    0     0  3023k      0 --:--:-- --:--:-- --:--:-- 3495k
http.admin.downstream_cx_upgrades_active: 0
http.ingress_ws_to_ws.downstream_cx_upgrades_active: 0

We can see, the websocket connections are almost released immediately

`downstream_cx_destroy` metric(mean `Total connections destroyed`)

Check downstream_cx_destroy every time after above client codes ran finished

# curl localhost:9901/stats|grep downstream_cx_destroy
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21534    0 21534    0     0  3349k      0 --:--:-- --:--:-- --:--:-- 3504k
http.admin.downstream_cx_destroy: 36
http.admin.downstream_cx_destroy_active_rq: 0
http.admin.downstream_cx_destroy_local: 0
http.admin.downstream_cx_destroy_local_active_rq: 0
http.admin.downstream_cx_destroy_remote: 36
http.admin.downstream_cx_destroy_remote_active_rq: 0
http.ingress_ws_to_ws.downstream_cx_destroy: 2000
http.ingress_ws_to_ws.downstream_cx_destroy_active_rq: 2000
http.ingress_ws_to_ws.downstream_cx_destroy_local: 0
http.ingress_ws_to_ws.downstream_cx_destroy_local_active_rq: 0
http.ingress_ws_to_ws.downstream_cx_destroy_remote: 2000
http.ingress_ws_to_ws.downstream_cx_destroy_remote_active_rq: 2000
listener.0.0.0.0_10000.downstream_cx_destroy: 2000
listener.admin.downstream_cx_destroy: 36

the http.ingress_ws_to_ws.downstream_cx_destroy value increases by 500 each time. e.g.

500: first time ran finished
1000: second time ran finished
1500: third time ran finished
2000: forth time ran finished

conclusion

It seems the websocket connections are closed after each client codes ran finished.

Jan 23 '24 03:01 ningyougang

A good news is that, i ran above client codes several times, found below good info

The memory remains at a stable value after run client codes second++ times, e.g. 42M~43M
- So, i guess that the increased memory(42M - 12M = 30M) is for the initial resource of websocket
  - But not sure
the websocket connections are released immediately after each client codes ran finished
- this value: http.ingress_ws_to_ws.downstream_cx_upgrades_active is 0 after each client codes ran finished

If has some tool to get the c++ objects of the increased memory: 42M - 12M = 30M, it would be better.

Jan 23 '24 07:01 ningyougang

Let’s make a simple summary now

increase memory phenomenon does exist.
after second(or second++) tests, memory remains at a relatively stable value
the increased part memory is not released anymore in spite of all websocket connections are closed
- I think this increased part memory is the initial resource for websocket (Just guess)
- If can find the c++ objects of the increased part memory, it would be best
  - Is there any technology that can be analyzed in depth?

Jan 24 '24 07:01 ningyougang

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Feb 23 '24 08:02 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Mar 01 '24 12:03 github-actions[bot]

envoy envoy copied to clipboard

Envoy does not release memory when upstream is websocket server

Description

Repro steps

check memory using ctop tool

My question

downstream_cx_upgrades_active metric (mean Total active upgraded connections)

downstream_cx_destroy metric(mean Total connections destroyed)

conclusion

envoy
envoy copied to clipboard

`downstream_cx_upgrades_active` metric (mean `Total active upgraded connections`)

`downstream_cx_destroy` metric(mean `Total connections destroyed`)