dice icon indicating copy to clipboard operation
dice copied to clipboard

[DRAFT] Fix concurrency issue #1591

Open mintobit opened this issue 8 months ago • 8 comments

Fixes https://github.com/DiceDB/dice/issues/1591

The source of the issue is absence of synchronization when Client is writing to and reading from TCP connection.

Need your feedback, to make sure you are ok with the approach I took:

  • Use fixed 4 byte header for each message (wire.Command/wire.Response)
  • Add correlation_id to each command/response to match them
  • On Client side, read responses from net.Conn in a loop
  • Put responses in corresponding channel (map[corr_id] = chan *Response)
  • Each Client.Fire() reads from a corresponding channel

Here are the links to all of the PRs (this fix requires changes to multiple repositories): dicedb-go PR dice PR dicedb-protos PR

  • [x] Communicate with maintainers on the overall approach
  • [ ] Make code production-ready (error handling etc)

Before this change: before After this change: after

mintobit avatar Mar 23 '25 11:03 mintobit

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Mar 23 '25 11:03 CLAassistant

Thanks a ton for the patch. This has been long running in my head. Reviewing it tomorrow.

arpitbbhayani avatar Mar 25 '25 18:03 arpitbbhayani

Thanks a ton for the patch. This has been long running in my head. Reviewing it tomorrow.

Glad to help.

It's far from being ready to be merged. Just wanted to validate the approach early on. Let me know what you think.

If the overall idea is good, I could work on improvements. One thing on my mind is also how to introduce these changes in a backwards-compatible way

mintobit avatar Mar 25 '25 19:03 mintobit

@mintobit I went through all the PRs. A couple of comments

  1. The header makes complete sense, and makes things predictable. So completely aligned with the approach.
  2. I did not understand the use of CorrId. Given that we are spinning up a new IOThread to handle each connection/client, does this CorrId change how does it enable us to support multiple commands? Post this change will we be able to support multiple commands from diff clients - I think we are still handling them. Or will it handle multiple commands from the same client over the same TCP connection? - If yes, then this takes us towards supporting TCP pipelining.

Wanted to understand your thgouhts here.

arpitbbhayani avatar Mar 29 '25 17:03 arpitbbhayani

@arpitbbhayani thanks for checking this out.

IOThread per connection works well when somebody executes commands concurrently over multiple dicedb clients (because this creates two separate TCP connections).

#1591 is about concurrent execution over the same client (hence over the same TCP connection). Currently we just allow concurrent writes/reads over the same connection, so the outcome depends on timing.

Two options to fix it that I can see:

  1. Serialize Client.Fire() calls using a mutex. This would mean that a concurrent client.Fire() will have to wait until server responds to the first client.Fire(). Easy fix, adds thread safety, but limits concurrency (kinda like SERIALIZABLE isolation level in databases)
  2. Allow multiple commands to be written over the same TCP connection before any response is produced by server. When reading responses over TCP, use correlation id to figure out for which command this response is. More complex fix, adds thread safety, allows better concurrency

Let's say we have two commands executed concurrently:

  • COMMAND1 (30ms server latency)
  • COMMAND2 (1ms server latency) Option 1 means that COMMAND2 would have to wait 30ms before it could be sent to the server Option 2 means that COMMAND2 can be sent to server without waiting 30ms (but because the server still executes them sequentially, the response to COMMAND2 would arrive after 30ms + 1ms)

This PR implements option 2

A potential improvement would we to allow the server to process multiple commands from the same TCP connection concurrently . This is not implemented here, but makes it possible because of correlation ids and ability two write multiple commands/responses to the same connection

mintobit avatar Mar 29 '25 19:03 mintobit

@mintobit Aligned. We can go ahead with the change. Thanks a ton for the patch.

Let me know whenever you are ready for the merge. Will give one final review before we merge. This is fab!

arpitbbhayani avatar Apr 01 '25 16:04 arpitbbhayani

@arpitbbhayani this is essentially a change in protocol between the client and server, is it mandatory for us to keep older dicedb-go versions with these changes?

I thought about using some kind of protocol version in the HANDSHAKE. But this would only help for future changes, since older dicedb-go won't have that. Plus the HANDSHAKE is just a command, which means it is also affected by the protocol change (fixed prefix for message)

I guess I might figure out some way to communicate protocol version between client and server. But wan't to make sure it is worth the time

mintobit avatar Apr 01 '25 20:04 mintobit

FYI, working on making this production ready.

I want to separate the protocol part into separate repo, so that both client and server use the same exact implementation of reads/writes. This would reduce the burden in terms of testing server/client compatibility. Will ask to transfer the repo once I'm done

mintobit avatar Apr 05 '25 20:04 mintobit

Closing to reopen MR with larger scope

mintobit avatar Apr 15 '25 21:04 mintobit