memcached icon indicating copy to clipboard operation
memcached copied to clipboard

Connection cross-talk in high load situations

Open jalava opened this issue 10 years ago • 7 comments

Hi,

We noticed node-memcached returning wrong documents when node was under very high load of hundreds of requests per seconds.

In these situations, doing get for key A returned document for key B and this lead us to believe that there might be a bug with the connection pooling as situation only happens under high load and in bursts of few seconds. Upgrading node-memcached from 2.0.1 to 2.1.0 worsened the sitation more, and we were seeing this issue about once a hour for 10-20 crosstalks per hour when it happened within 5-10 seconds.

We were running multiple workers per server with multiple servers and bit over 6000 requests per second in total so, which counts down to about 10 reqs per worker and we had poolSize of 10, which made me think that when poolSize is reached, there may be a issue where socket is released too early for next connection and buffer returns bad data.

jalava avatar Feb 10 '15 07:02 jalava

Looked at jackpot conneciton code and these might cause it: in https://github.com/3rd-Eden/jackpot/blob/master/index.js:

  // o, dear, we got issues.. we didn't find a valid connection and we cannot
  // create more.. so we are going to check if we might have semi valid
  // connection by sorting the probabilities array and see if it has
  // a probability above 60
  probability = probabilities.sort(function sort(a, b) {
    return a.probability - b.probability;
  }).pop();

  if (probability && probability.probability >= 60) {
    fn(undefined, probability.connection);
    return this;
  }

and

  // We didn't find any reliable states of the stream, so we are going to
  // assume something random, because we have no clue, so generate a random
  // number between 0 - 70.
  return Math.floor(Math.random() * 70);

This means that unknown state stream might be returned 1 out of 7 times as it's probablity goes up to 70 and we allow streams with probability over 60 to be reused.

jalava avatar Feb 10 '15 09:02 jalava

This sounds bad. I don't like the idea that a connection pooling library randomly returns connections if it doesn't actually know if they are free or not. That's just a disaster waiting to happen.

garo avatar Feb 10 '15 09:02 garo

@garo it's not that bad it sounds as it only checks if they are ready to write or not. If they are not ready the data would be queued and take a bit longer to get ready, we're talking milliseconds here..

3rd-Eden avatar Feb 10 '15 11:02 3rd-Eden

@3rd-Eden What about when you have pending writes to connection and you give that connection from pool to new client, how does new client know which client the response is meant for?

Use case:

  • We are running out of connections in pool
  • Only isAvailable is otherwise available but it has writes = 1, giving it probability of 99
  • This connection is pulled from manager and new "get" is written to that connection.
  • If previous write was also "get", both clients receive same response because they are listening to the same buffer. If clients drain the data from buffer, it might work if listeners are called in correct order.

This is more of a jackpot issue it think rather than node-memcached issue, and needs some testing to see if I can make easily duplicated test case for this.

jalava avatar Feb 10 '15 12:02 jalava

This might also be memcached server related. We noticed that we're running 1.4.14 and since that a few interesting bugs have been fixed which might be related to this. We'll upgade our memcached instances to see if this helps. :)

garo avatar Feb 11 '15 07:02 garo

I did notice that node-memcached does not check key token in VALUE to match key in metadata, which would had found out these issues immediately.

Would do pull request but not sure how should it handle the situation where VALUE is mismatched, close all connections or search in metadata for potentially correct answer?

It's quite strange though as only way this can happen is that either server doesn't answer anything to one command but does answer to next one in same connection (commands and answers get offset by one where A receives answer for B, B receives answer for C etc) or server misses sending END for first command but does send it for next ones, and then first command receives both values. TCP should handle most cases, so bug in memcached server itself seems like only sensible answer.

First case can be detected by checking the key for VALUE, second case by putting in simple state machine logic that doesn't allow two commands without END in between commands.

jalava avatar Feb 11 '15 07:02 jalava

any solutions for this? Experiencing this frequently under load.

kcarlson avatar Oct 13 '16 08:10 kcarlson