corruption in the stream code
The original idea: The idea is based on the fact that the stream code writes somehow lp size using lpSetTotalBytes into already freed block. Before lpSetTotalBytes corrupts the memory, it should have low 32 bits of the "next" field and based on the examples we have it should be pretty large number or 0 (in case block->next is null).
in any case we read first with lpGetTotalBytes and if it is say 0 - it's a bug (we crash) because lp can not have 0 bytes (it has at least 7 bytes meta data) and if it's higher 16MB (highly unlikely) we call zmalloc_usable_size and check that the usable size is equal or higher than lpGetTotalBytes (consistency). and if not we abort (crash)
The PR: https://github.com/dragonflydb/dragonfly/pull/5252
The original crash log:
@ 0xaaaaab3b2f18 192 boost::context::detail::fiber_entry<>() |
-- | --
| | @ 0xaaaaab3b291c 576 util::ListenerInterface::RunSingleConnection() |
| | @ 0xaaaaaafd9a64 368 facade::Connection::HandleRequests() |
| | @ 0xaaaaaafd8b18 272 facade::Connection::ConnectionFlow() |
| | @ 0xaaaaaafd85ac 1680 facade::Connection::IoLoop() |
| | @ 0xaaaaaafd76bc 304 facade::Connection::ParseRedis() |
| | @ 0xaaaaaafd716c 464 facade::Connection::DispatchSingle() |
| | @ 0xaaaaaac18914 672 dfly::Service::DispatchCommand() |
| | @ 0xaaaaaac12aa4 208 dfly::Service::InvokeCmd() |
| | @ 0xaaaaaaeea4d4 672 dfly::CommandId::Invoke() |
| | @ 0xaaaaaace88d8 32 dfly::StreamFamily::XAdd() |
| | @ 0xaaaaaaf09fe0 64 dfly::Transaction::ScheduleSingleHop() |
| | @ 0xaaaaaaf09f48 416 dfly::Transaction::Execute() |
| | @ 0xaaaaaaf09684 592 dfly::Transaction::ScheduleInternal() |
| | @ 0xaaaaaaf082f0 288 dfly::Transaction::ScheduleInShard() |
| | @ 0xaaaaaaf07b08 208 dfly::Transaction::RunCallback() |
| | @ 0xaaaaaacefebc 1344 absl::lts_20240722::functional_internal::InvokeObject<>() |
| | @ 0xaaaaaacef0b8 32 dfly::(anonymous namespace)::OpAdd() |
| | @ 0xaaaaab00866c 32 lpNew |
| | @ 0xfffff7ffb8f8 4960 (unknown) |
| | @ 0xaaaaab5ec214 480 absl::lts_20240722::AbslFailureSignalHandler() |
| | PC: @ 0xaaaaab01f298 (unknown) _mi_page_malloc_zero |
| | *** SIGSEGV received at time=1749307602 on cpu 0 ***
I can be wrong, but it seems to me that https://github.com/dragonflydb/dragonfly/issues/5202 may be relevant to this one.
I see that you merged this PR: #5252. Is this bug fixed? What is the status of this issue?
@BagritsevichStepan This is my opinion based on testing. I couldn't reproduce this particular issue with fuzzing and script testing. But https://github.com/dragonflydb/dragonfly/issues/5202 causes so many times for the stream family. This is my assumption that they can be relevant.
#5202 is issue about memory usage tracking. Memory usage tracking should not touch other logic. We never use the stream size in stream_family.cc. And we do not free memory based on memory usage values. I don't see how this bug can be related. It should be another issue