pystatsd
pystatsd copied to clipboard
Race condition between __record_timer() and flush()
Occasionally our pystatsd instance freezes, maybe in a similar way as in #40. Strace showed a KeyError
in the following line:
See server.py:118
if key not in self.timers:
self.timers[key] = [ [], ts ]
self.timers[key][0].append(float(value or 0))
self.timers[key][1] = ts # <--- KeyError
It seems that the check ensures that the key is always present in self.timers
. However, there's a second thread running, flushing the contents of self.timers
periodically, and also deleting the contents of self.timers
.
See server.py:228
del(self.timers[k])
When the flush happens right between the check in line 115 and the access in lines 117 and 118 then a KeyError
will be raised. We think this is exactly what happens.
We haven't seen it happen in self.__record_counter
, but it looks vulnerable in the same way.
We haven't had time yet to come up with a patch, but maybe something like this would work:
Use a threading.Lock()
to synchronize access to self.timers
and self.counters
. The self.flush()
involves network communication and can therefore take a long time. We don't want to block the receiving code during that time. Therefore maybe just use the lock to (deep-)copy self.timers
and reset it. Then release the lock and flush from the copy. (Or not copy at all, but just switch references, i.e. assign self.timers to a second variable pointing to the same dictionary and then assign self.timers an empty dict and continue flushing from the new variable.)
I just checked the master code and it doesn't involve the key check anymore (we're using 0.1.10). So maybe that bug is fixed. What's the ETA for the next release?
I looked at the changeset that fixes the problem.
It now looks like this:
timer = self.timers.setdefault(key, [ [], ts ])
timer[0].append(float(value or 0))
timer[1] = ts
When during flush the key is now deleted directly after setdefault
returns, then the new value will be stored only into the local timer value. And, I believe, it will simply be lost and never flushed.
I think that's OK, depending on the guarantees that pystatsd wants to make. If messages should not be dropped, then a different solution must be chosen.
@vvuksan Can you take a look at this when you get some time?