envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Envoy does not release memory when upstream is websocket server

Open ningyougang opened this issue 1 year ago • 6 comments

Description

  • Envoy does not release memory when upstream is websocket server
    • does not exist the problem if upstream is http server

Repro steps

  • startup a envoy container to make envoy as proxy pointing to websocket server

    docker run -v /envoy-websocket.yaml:/etc/envoy/envoy.yaml  -d  --name envoy-websocket-container -p 9901:9901 -p 10000:10000 \
    envoyproxy/envoy:v1.28-latest  -c /etc/envoy/envoy.yaml
    
  • config(/envoy-websocket.yaml)

    $ cat /envoy-websocket.yaml
    admin:
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 9901
    
    static_resources:
      listeners:
      - address:
          socket_address:
            address: 0.0.0.0
            port_value: 10000
        filter_chains:
        - filters:
          - name: envoy.filters.network.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              stat_prefix: ingress_ws_to_ws
              upgrade_configs:
              - upgrade_type: websocket
              route_config:
                name: local_route
                virtual_hosts:
                - name: app
                  domains:
                  - "*"
                  routes:
                  - match:
                      prefix: "/"
                    route:
                      cluster: service_ws
              http_filters:
              - name: envoy.filters.http.router
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
    
      clusters:
      - name: service_ws
        type: STRICT_DNS
        lb_policy: ROUND_ROBIN
        load_assignment:
          cluster_name: service_ws
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: xxx.xxx.xxx.xxx // this is my local websocket address
                    port_value: 80
    
    

    Fyi, i refered this article: https://www.envoyproxy.io/docs/envoy/latest/start/sandboxes/websocket

  • send requests to envoy address using below client codes

    click here to check client codes
    
    package main
    

    import ( "fmt" "github.com/gorilla/websocket" "strconv" "strings" "sync" "time" )

    func main() { // raw envoy serverURL := "ws://$myEnvoyAddress:10000" connectionCount := "500" durationSeconds := "300" interval := "500" payloadSize := "16"

      // Parse durationSeconds from string to int
      duration, err := time.ParseDuration(durationSeconds + "s")
      if err != nil {
      	fmt.Println("Invalid duration provided:", err)
      	return
      }
    
      // Convert connectionCount from string to int
      connCount, err := strconv.Atoi(connectionCount)
      if err != nil {
      	fmt.Println("Invalid connection count provided:", err)
      	return
      }
    
      sz, err := strconv.Atoi(payloadSize)
      if err != nil {
      	fmt.Println("Invalid connection count provided:", err)
      	return
      }
    
      payload := strings.Repeat("a", sz*1024) // if sz==16, set 16KB payload
    
      fmt.Println("Connecting to server:", serverURL)
    
      // Create a wait group to wait for all goroutines to finish
      var wg sync.WaitGroup
    
      // Store latency values for later calculation
      var latencies []time.Duration
      totalConnCount := 0
    
      errorCount := 0
      uniqueErrors := make(map[string]int)
      errorMutex := sync.Mutex{}
    
      // Perform connections and communication
      for i := 0; i < connCount; i++ {
      	wg.Add(1)
      	go func() {
      		defer wg.Done()
    
      		conn, _, err := websocket.DefaultDialer.Dial(serverURL, nil)
      		if err != nil {
      			fmt.Println("WebSocket connection error:", err)
      			errorMutex.Lock()
      			if strings.Contains(err.Error(), "read: connection reset by peer") {
      				uniqueErrors["read: connection reset by peer"]++
      			} else if strings.Contains(err.Error(), "write: broken pipe") {
      				uniqueErrors["write: broken pipe"]++
      			} else {
      				uniqueErrors[err.Error()]++
      			}
      			errorCount++
      			errorMutex.Unlock()
      			return
      		}
      		totalConnCount++
      		defer conn.Close()
    
      		// Send and receive messages for the specified duration
      		startTime := time.Now()
    
      		for time.Since(startTime) < duration {
      			// Send message
      			sendTime := time.Now()
      			err = conn.WriteMessage(websocket.TextMessage, []byte(payload))
      			if err != nil {
      				fmt.Println("Message send error:", err)
      				return
      			}
    
      			// Receive response with timeout
      			_, response, err := conn.ReadMessage()
      			if err != nil {
      				fmt.Println("Message receive error:", err)
      				errorMutex.Lock()
      				if strings.Contains(err.Error(), "read: connection reset by peer") {
      					uniqueErrors["read: connection reset by peer"]++
      				} else if strings.Contains(err.Error(), "write: broken pipe") {
      					uniqueErrors["write: broken pipe"]++
      				} else {
      					uniqueErrors[err.Error()]++
      				}
      				errorCount++
      				errorMutex.Unlock()
      				return
      			}
    
      			messageSize := len(response)
      			sizeString := fmt.Sprintf("%d Byte", messageSize)
      			if messageSize >= 1024 {
      				sizeString = fmt.Sprintf("%.2f KB", float64(messageSize)/1024)
      			}
    
      			// Calculate and store latency
      			endTime := time.Now()
      			latency := endTime.Sub(sendTime)
      			latencies = append(latencies, latency)
    
      			fmt.Printf("%d/%d %d:%d:%d Received response, size: %s\n", endTime.Month(), endTime.Day(), endTime.Hour(), endTime.Minute(), endTime.Second(), sizeString)
    
      			// Wait for 0.5 seconds before sending the next message
      			iint, err := strconv.Atoi(interval)
      			if err != nil {
      				fmt.Println("Convert str to int error:", err)
      			}
      			time.Sleep(time.Duration(iint) * time.Millisecond)
      		}
      	}()
      }
    
      // Wait for all goroutines to finish
      wg.Wait()
    
      // Calculate and display average latency
      totalLatency := time.Duration(0)
      for _, latency := range latencies {
      	totalLatency += latency
      }
      //totalCount := len(latencies)
      averageLatency := totalLatency / time.Duration(len(latencies))
      fmt.Println("Average latency:", averageLatency)
      fmt.Println("Total send messages:", len(latencies))
      fmt.Println("Total connections:", totalConnCount)
      // Print unique error messages and count
      fmt.Println("Total errors:", errorCount)
      for errMsg := range uniqueErrors {
      	fmt.Println("Error:", errMsg, "(", uniqueErrors[errMsg], ")")
      }
    

    }

check memory using ctop tool

  • Before send requests to envoy

    image We can see, the memory is 16M initally.

  • After send requests to envoy

    image We can see, current memory is 39M. I waited about at least 30.minutes, the memory is 39M still.

My question

  • Is the envoy itself bug?(I used envoy's lastest image to test)
  • If not, what did i miss?

ningyougang avatar Jan 12 '24 02:01 ningyougang

cc @alyssawilk

phlax avatar Jan 12 '24 10:01 phlax

I'd take a look at your stats and see if the connection is considered closed by Envoy or not.

alyssawilk avatar Jan 16 '24 15:01 alyssawilk

@alyssawilk

downstream_cx_upgrades_active metric (mean Total active upgraded connections)

  • Before run above client codes

    # curl localhost:9901/stats|grep downstream_cx_upgrades_active
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 21534    0 21534  http.admin.downstream_cx_upgrades_active: 0
    http.ingress_ws_to_ws.downstream_cx_upgrades_active: 0
    
  • Check the downstream_cx_upgrades_active metric during client codes running

    # curl localhost:9901/stats|grep downstream_cx_upgrades_active
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 21534    0 21534    0     0  3315k      0 --:--:-- --:--:-- --:--:-- 3504k
    http.admin.downstream_cx_upgrades_active: 0
    http.ingress_ws_to_ws.downstream_cx_upgrades_active: 500
    

    http.ingress_ws_to_ws.downstream_cx_upgrades_active is 500

  • Check the downstream_cx_upgrades_active metric after client codes ran finished(almost check immediately)

    # curl localhost:9901/stats|grep downstream_cx_upgrades_active
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 21477    0 21477    0     0  3023k      0 --:--:-- --:--:-- --:--:-- 3495k
    http.admin.downstream_cx_upgrades_active: 0
    http.ingress_ws_to_ws.downstream_cx_upgrades_active: 0
    

    We can see, the websocket connections are almost released immediately

downstream_cx_destroy metric(mean Total connections destroyed)

  • Check downstream_cx_destroy every time after above client codes ran finished
    # curl localhost:9901/stats|grep downstream_cx_destroy
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 21534    0 21534    0     0  3349k      0 --:--:-- --:--:-- --:--:-- 3504k
    http.admin.downstream_cx_destroy: 36
    http.admin.downstream_cx_destroy_active_rq: 0
    http.admin.downstream_cx_destroy_local: 0
    http.admin.downstream_cx_destroy_local_active_rq: 0
    http.admin.downstream_cx_destroy_remote: 36
    http.admin.downstream_cx_destroy_remote_active_rq: 0
    http.ingress_ws_to_ws.downstream_cx_destroy: 2000
    http.ingress_ws_to_ws.downstream_cx_destroy_active_rq: 2000
    http.ingress_ws_to_ws.downstream_cx_destroy_local: 0
    http.ingress_ws_to_ws.downstream_cx_destroy_local_active_rq: 0
    http.ingress_ws_to_ws.downstream_cx_destroy_remote: 2000
    http.ingress_ws_to_ws.downstream_cx_destroy_remote_active_rq: 2000
    listener.0.0.0.0_10000.downstream_cx_destroy: 2000
    listener.admin.downstream_cx_destroy: 36
    
    the http.ingress_ws_to_ws.downstream_cx_destroy value increases by 500 each time. e.g.
    • 500: first time ran finished
    • 1000: second time ran finished
    • 1500: third time ran finished
    • 2000: forth time ran finished

conclusion

It seems the websocket connections are closed after each client codes ran finished.

ningyougang avatar Jan 23 '24 03:01 ningyougang

A good news is that, i ran above client codes several times, found below good info

  • The memory remains at a stable value after run client codes second++ times, e.g. 42M~43M image
    • So, i guess that the increased memory(42M - 12M = 30M) is for the initial resource of websocket
      • But not sure
  • the websocket connections are released immediately after each client codes ran finished
    • this value: http.ingress_ws_to_ws.downstream_cx_upgrades_active is 0 after each client codes ran finished

If has some tool to get the c++ objects of the increased memory: 42M - 12M = 30M, it would be better.

ningyougang avatar Jan 23 '24 07:01 ningyougang

Let’s make a simple summary now

  • increase memory phenomenon does exist.
  • after second(or second++) tests, memory remains at a relatively stable value
  • the increased part memory is not released anymore in spite of all websocket connections are closed
    • I think this increased part memory is the initial resource for websocket (Just guess)
    • If can find the c++ objects of the increased part memory, it would be best
      • Is there any technology that can be analyzed in depth?

ningyougang avatar Jan 24 '24 07:01 ningyougang

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Feb 23 '24 08:02 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

github-actions[bot] avatar Mar 01 '24 12:03 github-actions[bot]