envoy
envoy copied to clipboard
Envoy does not release memory when upstream is websocket server
Description
- Envoy does not release memory when upstream is websocket server
- does not exist the problem if upstream is http server
Repro steps
-
startup a envoy container to make envoy as proxy pointing to websocket server
docker run -v /envoy-websocket.yaml:/etc/envoy/envoy.yaml -d --name envoy-websocket-container -p 9901:9901 -p 10000:10000 \ envoyproxy/envoy:v1.28-latest -c /etc/envoy/envoy.yaml -
config(/envoy-websocket.yaml)
$ cat /envoy-websocket.yaml admin: address: socket_address: address: 0.0.0.0 port_value: 9901 static_resources: listeners: - address: socket_address: address: 0.0.0.0 port_value: 10000 filter_chains: - filters: - name: envoy.filters.network.http_connection_manager typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_ws_to_ws upgrade_configs: - upgrade_type: websocket route_config: name: local_route virtual_hosts: - name: app domains: - "*" routes: - match: prefix: "/" route: cluster: service_ws http_filters: - name: envoy.filters.http.router typed_config: "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router clusters: - name: service_ws type: STRICT_DNS lb_policy: ROUND_ROBIN load_assignment: cluster_name: service_ws endpoints: - lb_endpoints: - endpoint: address: socket_address: address: xxx.xxx.xxx.xxx // this is my local websocket address port_value: 80Fyi, i refered this article: https://www.envoyproxy.io/docs/envoy/latest/start/sandboxes/websocket
-
send requests to envoy address using below client codes
click here to check client codes
package mainimport ( "fmt" "github.com/gorilla/websocket" "strconv" "strings" "sync" "time" )
func main() { // raw envoy serverURL := "ws://$myEnvoyAddress:10000" connectionCount := "500" durationSeconds := "300" interval := "500" payloadSize := "16"
// Parse durationSeconds from string to int duration, err := time.ParseDuration(durationSeconds + "s") if err != nil { fmt.Println("Invalid duration provided:", err) return } // Convert connectionCount from string to int connCount, err := strconv.Atoi(connectionCount) if err != nil { fmt.Println("Invalid connection count provided:", err) return } sz, err := strconv.Atoi(payloadSize) if err != nil { fmt.Println("Invalid connection count provided:", err) return } payload := strings.Repeat("a", sz*1024) // if sz==16, set 16KB payload fmt.Println("Connecting to server:", serverURL) // Create a wait group to wait for all goroutines to finish var wg sync.WaitGroup // Store latency values for later calculation var latencies []time.Duration totalConnCount := 0 errorCount := 0 uniqueErrors := make(map[string]int) errorMutex := sync.Mutex{} // Perform connections and communication for i := 0; i < connCount; i++ { wg.Add(1) go func() { defer wg.Done() conn, _, err := websocket.DefaultDialer.Dial(serverURL, nil) if err != nil { fmt.Println("WebSocket connection error:", err) errorMutex.Lock() if strings.Contains(err.Error(), "read: connection reset by peer") { uniqueErrors["read: connection reset by peer"]++ } else if strings.Contains(err.Error(), "write: broken pipe") { uniqueErrors["write: broken pipe"]++ } else { uniqueErrors[err.Error()]++ } errorCount++ errorMutex.Unlock() return } totalConnCount++ defer conn.Close() // Send and receive messages for the specified duration startTime := time.Now() for time.Since(startTime) < duration { // Send message sendTime := time.Now() err = conn.WriteMessage(websocket.TextMessage, []byte(payload)) if err != nil { fmt.Println("Message send error:", err) return } // Receive response with timeout _, response, err := conn.ReadMessage() if err != nil { fmt.Println("Message receive error:", err) errorMutex.Lock() if strings.Contains(err.Error(), "read: connection reset by peer") { uniqueErrors["read: connection reset by peer"]++ } else if strings.Contains(err.Error(), "write: broken pipe") { uniqueErrors["write: broken pipe"]++ } else { uniqueErrors[err.Error()]++ } errorCount++ errorMutex.Unlock() return } messageSize := len(response) sizeString := fmt.Sprintf("%d Byte", messageSize) if messageSize >= 1024 { sizeString = fmt.Sprintf("%.2f KB", float64(messageSize)/1024) } // Calculate and store latency endTime := time.Now() latency := endTime.Sub(sendTime) latencies = append(latencies, latency) fmt.Printf("%d/%d %d:%d:%d Received response, size: %s\n", endTime.Month(), endTime.Day(), endTime.Hour(), endTime.Minute(), endTime.Second(), sizeString) // Wait for 0.5 seconds before sending the next message iint, err := strconv.Atoi(interval) if err != nil { fmt.Println("Convert str to int error:", err) } time.Sleep(time.Duration(iint) * time.Millisecond) } }() } // Wait for all goroutines to finish wg.Wait() // Calculate and display average latency totalLatency := time.Duration(0) for _, latency := range latencies { totalLatency += latency } //totalCount := len(latencies) averageLatency := totalLatency / time.Duration(len(latencies)) fmt.Println("Average latency:", averageLatency) fmt.Println("Total send messages:", len(latencies)) fmt.Println("Total connections:", totalConnCount) // Print unique error messages and count fmt.Println("Total errors:", errorCount) for errMsg := range uniqueErrors { fmt.Println("Error:", errMsg, "(", uniqueErrors[errMsg], ")") }}
check memory using ctop tool
-
Before
send requests to envoyWe can see, the memory is
16Minitally. -
After send requests to envoy
We can see, current memory is
39M. I waited about at least 30.minutes, the memory is39Mstill.
My question
- Is the envoy itself bug?(I used envoy's lastest image to test)
- If not, what did i miss?
cc @alyssawilk
I'd take a look at your stats and see if the connection is considered closed by Envoy or not.
@alyssawilk
downstream_cx_upgrades_active metric (mean Total active upgraded connections)
-
Before run above client codes
# curl localhost:9901/stats|grep downstream_cx_upgrades_active % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 21534 0 21534 http.admin.downstream_cx_upgrades_active: 0 http.ingress_ws_to_ws.downstream_cx_upgrades_active: 0 -
Check the
downstream_cx_upgrades_activemetric during client codes running# curl localhost:9901/stats|grep downstream_cx_upgrades_active % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 21534 0 21534 0 0 3315k 0 --:--:-- --:--:-- --:--:-- 3504k http.admin.downstream_cx_upgrades_active: 0 http.ingress_ws_to_ws.downstream_cx_upgrades_active: 500http.ingress_ws_to_ws.downstream_cx_upgrades_activeis500 -
Check the
downstream_cx_upgrades_activemetric after client codes ran finished(almost check immediately)# curl localhost:9901/stats|grep downstream_cx_upgrades_active % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 21477 0 21477 0 0 3023k 0 --:--:-- --:--:-- --:--:-- 3495k http.admin.downstream_cx_upgrades_active: 0 http.ingress_ws_to_ws.downstream_cx_upgrades_active: 0We can see, the websocket connections are almost released immediately
downstream_cx_destroy metric(mean Total connections destroyed)
- Check
downstream_cx_destroyevery time after above client codes ran finished
the# curl localhost:9901/stats|grep downstream_cx_destroy % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 21534 0 21534 0 0 3349k 0 --:--:-- --:--:-- --:--:-- 3504k http.admin.downstream_cx_destroy: 36 http.admin.downstream_cx_destroy_active_rq: 0 http.admin.downstream_cx_destroy_local: 0 http.admin.downstream_cx_destroy_local_active_rq: 0 http.admin.downstream_cx_destroy_remote: 36 http.admin.downstream_cx_destroy_remote_active_rq: 0 http.ingress_ws_to_ws.downstream_cx_destroy: 2000 http.ingress_ws_to_ws.downstream_cx_destroy_active_rq: 2000 http.ingress_ws_to_ws.downstream_cx_destroy_local: 0 http.ingress_ws_to_ws.downstream_cx_destroy_local_active_rq: 0 http.ingress_ws_to_ws.downstream_cx_destroy_remote: 2000 http.ingress_ws_to_ws.downstream_cx_destroy_remote_active_rq: 2000 listener.0.0.0.0_10000.downstream_cx_destroy: 2000 listener.admin.downstream_cx_destroy: 36http.ingress_ws_to_ws.downstream_cx_destroyvalue increases by 500 each time. e.g.- 500: first time ran finished
- 1000: second time ran finished
- 1500: third time ran finished
- 2000: forth time ran finished
conclusion
It seems the websocket connections are closed after each client codes ran finished.
A good news is that, i ran above client codes several times, found below good info
- The memory remains at a stable value after
run client codessecond++ times,e.g. 42M~43M- So, i guess that the increased memory(42M - 12M = 30M) is for
the initial resource of websocket- But not sure
- So, i guess that the increased memory(42M - 12M = 30M) is for
- the websocket connections are released immediately after each
client codesran finished- this value:
http.ingress_ws_to_ws.downstream_cx_upgrades_activeis0after eachclient codesran finished
- this value:
If has some tool to get the c++ objects of the increased memory: 42M - 12M = 30M, it would be better.
Let’s make a simple summary now
increase memoryphenomenon does exist.- after second(or second++) tests, memory remains at a relatively stable value
- the
increased part memoryis not released anymore in spite ofall websocket connections are closed- I think this
increased part memoryis the initial resource for websocket (Just guess) - If can find the
c++ objectsof theincreased part memory, it would be best- Is there any technology that can be analyzed in depth?
- I think this
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.