rocksdb
rocksdb copied to clipboard
Confusion about `MemTableInfo` semantics
Expected behavior
There're two seqno fields in MemTableInfo
:
- first_seqno:
https://github.com/facebook/rocksdb/blob/50e9b3f9c7e9198257d0398bf14c8ddad70527cb/include/rocksdb/listener.h#L481-L483
- earliest_seqno:
https://github.com/facebook/rocksdb/blob/50e9b3f9c7e9198257d0398bf14c8ddad70527cb/include/rocksdb/listener.h#L484-L488
Actual behavior
In implementation, first_seqno
is actually the smallest seqno, which is fine because it's difficult to reason about the ordering between concurrent inserts. But it's unclear whether users can rely on first_seqno
being the smallest because there's no official document about it.
https://github.com/facebook/rocksdb/blob/50e9b3f9c7e9198257d0398bf14c8ddad70527cb/db/memtable.cc#L834-L837
earliest_seqno
is actually smallest seqno - 1, this violates the document that "It can then be assumed that any write with a larger(or equal) sequence number will be present in this memtable or a later memtable."
https://github.com/facebook/rocksdb/blob/50e9b3f9c7e9198257d0398bf14c8ddad70527cb/db/db_impl/db_impl_write.cc#L2113-L2114
earliest_seqno
is actually smallest seqno - 1, this violates the document that "It can then be assumed that any write with a larger(or equal) sequence number will be present in this memtable or a later memtable."
Agreed, the API doc is wrong. I went through the same confusion in https://github.com/facebook/rocksdb/pull/11861#discussion_r1332069364. If you have ideas or time to fix it, please feel welcome to contribute :)
Seems like the fix is just as simple as
Any ritual to do before changing the public API? @ajkr
There is no ritual for fixing the API documentation. Please feel free to submit a PR like the diff in your screenshot; we can review it then.