ssdb-rocks
ssdb-rocks copied to clipboard
Multithread scalability issue -- BinlogQueue lock contention
I am running multiple reader/writer worker threads and seeing 15ms latency spikes. Based on various test results, I think the cause of high tail latency is contention for the lock associated with the BinlogQueue object (acquired when a Transaction object is created in functions like SSDB::set()). An SSDB instance has a single BinlogQueue object, so the lock acts as a global lock. If a worker thread is de-scheduled when it holds the lock, no other threads can make progress until this thread is re-scheduled, finishes its transaction, and releases the lock.
I am wondering about the purpose of BinlogQueue (which uses RocksDB's WriteBatch object). Each "batch" contains only a single Put() or Delete() operation, followed by a Put() associated with the BinlogQueue log (why?), and then db->Write() is called. My questions are:
-
What is the purpose of the BinlogQueue log? I see there is an option to set no_log = true for the BinlogQueue. When is this recommended?
-
Given that a transaction seems to only contain a single operation, is a lock operation in the Transaction constructor necessary? If yes, why?
Thanks.
Hi,
- The purpose of Binlog is for replication, the server will log every write to it and send it to a slave, the slave will replay the write operation. If you don't need master-slave replication, then you can set
binlog: no
in ssdb.conf. - The lock is necessary. For set operation, actually it is two writes: the key-value, and the binlog item, the two must be performed in one WriteBatch.
Thanks, since I do not need replication for my use-case, I have eliminated the use of WriteBatch and made each Put and Get operation go directly to the rocksdb database, ie: db->Put() and db->Get(), instead of adding operations to a lock-protected WriteBatch and then calling db->Write().
However, I still see a 15 ms performance gap between average and 95th percentile latency. Have you observed this as well? I see a 15 ms performance gap between avg and p95 latency even when I only use 1 reader and 1 writer and I thread-pin them to separate cores to avoid interference from the linux scheduler. So I wonder if there is some rare event (that occurs ~5% of the time) that causes ssdb to sleep or delay execution of a request for ~15ms. I am currently going through the code in detail, but if you have any ideas for explaining the latency spikes that occur 5% of the time, that would be very helpful. Thank you!