opamp-go
opamp-go copied to clipboard
opamp.Connection needs a Mutex to avoid concurrent write panic
In a busy server, we encountered a concurrent write panic. I added protection in the server, but ideally the library would add a Mutex to prevent this from happening.
Here is the relevant stack fragment:
panic: concurrent write to websocket connection
goroutine 115 [running]:
github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc001dc95a8, 0x1, {0xc001d88160?, 0x2?, 0xc001dc95d8?})
.../go/pkg/mod/github.com/gorilla/[email protected]/conn.go:617 +0x52b
github.com/gorilla/websocket.(*Conn).WriteMessage(0xc001da8f20, 0xc0003d2930?, {0xc001d88160, 0x142, 0x142})
.../go/pkg/mod/github.com/gorilla/[email protected]/conn.go:770 +0x139
github.com/open-telemetry/opamp-go/server.wsConnection.Send({0xc001dc9750?}, {0xc001d950e0?, 0xc001dba050?}, 0x270c?)
.../go/pkg/mod/github.com/open-telemetry/[email protected]/server/wsconnection.go:30 +0x55
...
This is working as intended per the Connection interface:
https://github.com/open-telemetry/opamp-go/blob/3f2eab449870fb6ed667fc55a205a68bf97a0935/server/types/connection.go#L16-L21
I will do some benchmarking with a large number of connections to determine the overhead of adding a mutex per connection.
FYI, we protect the connection with a mutex in our example. So it seems reasonable to just move this mutex inside the connection struct and have proper encapsulation.
One argument against having the mutex in the connection is that in the server implementation one may have a mutex per agent anyway (e.g. to protect agent's data fields) and that same mutex can be used to protect the connection. However, we don't do it in our example (we have a separate mutex for data fields) so perhaps the combined mutex is not such a good option (otherwise why don't we use it?).
I missed this issue while working on #200. In my opinion having separate mutexes for the WS connection and the agent data fields makes sense because the cost of the mutexes is low and it would cause contention between updating the agent and sending it messages if we had a shared mutex.
@andykellr @tigrannajaryan Are we satisfied that this is resolved?
Yes, I think we can close this.