yugabyte-db
yugabyte-db copied to clipboard
[DocDB] Failed to write a batch with 0 operations into RocksDB: Corruption (yb/dockv/doc_kv_util.cc:135): Error when decoding hashed components of a document key
Jira Link: DB-11251
Description
One of the CDC run, while we are testing PG Parity, observed corruption fatal on 2024.1.0.0-b123
F20240509 20:44:08 ../../src/yb/tablet/tablet.cc:1517] T 6b08eefae2b84fa189350a8c956a904f P d132e7cd9cf549f3b6f9df427e383c04: Failed to write a batch with 0 operations into RocksDB: Corruption (yb/dockv/doc_kv_util.cc:135): Error when decoding hashed components of a document key: while consuming primitive values from 5366326436323032612D363962342D343362372D386532632D6462643530633632663538353A31353838393A346532696563655A3F696C30673F6766306A3F6B3365313F323032686331696534686B683566336A66646630616A6D3168616867316B616C346632613331336964326A6635696C693667346B67656731626B6E3269626968326C626D356733623432346A65336B67366A6D6A3621324F4D513632322380013C328BE6C5BE804A: Encoded string is not terminated with \0x00\0x00
@ 0x56030e766427 google::LogMessage::SendToLog()
@ 0x56030e76735d google::LogMessage::Flush()
@ 0x56030e7679a9 google::LogMessageFatal::~LogMessageFatal()
@ 0x56030fbbb933 yb::tablet::Tablet::WriteToRocksDB()
@ 0x56030fbb7905 yb::tablet::Tablet::ApplyIntents()
@ 0x56030fbb84d2 yb::tablet::Tablet::ApplyIntents()
@ 0x56030fc70b71 yb::tablet::TransactionParticipant::Impl::ProcessReplicated()
@ 0x56030fb9204c yb::tablet::UpdateTxnOperation::DoReplicated()
@ 0x56030fb8579e yb::tablet::Operation::Replicated()
@ 0x56030fb87b4f yb::tablet::OperationDriver::ReplicationFinished()
@ 0x56030ec3ca2b yb::consensus::ConsensusRound::NotifyReplicationFinished()
@ 0x56030ec8b38f yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
@ 0x56030ec8a6f9 yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
@ 0x56030ec74004 yb::consensus::RaftConsensus::UpdateReplica()
@ 0x56030ec53f83 yb::consensus::RaftConsensus::Update()
@ 0x56030fed9f9e yb::tserver::ConsensusServiceImpl::UpdateConsensus()
@ 0x56030ece1fee std::__1::__function::__func<>::operator()()
@ 0x56030ece2c1f yb::consensus::ConsensusServiceIf::Handle()
@ 0x56030fadf649 yb::rpc::ServicePoolImpl::Handle()
@ 0x56030f9fc05f yb::rpc::InboundCall::InboundCallTask::Run()
@ 0x56030faeee43 yb::rpc::(anonymous namespace)::Worker::Execute()
@ 0x560310306c13 yb::Thread::SuperviseThread()
@ 0x7f73305551ca start_thread
@ 0x7f73307a6e73 __GI___clone
Here’s what testcase is about:
Create 10 databases
Create 1 table in each, for which we are interested in CDC streaming
In iteration manner, load data in each of the table. Validate they are streaming.
In parallel, there are nemesis happening. Server side nemesis in this run were: Stop/start nodes, Restart master process.
In parallel I am also creating and dropping dummy tables randomly.
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
- [X] I confirm this issue does not contain any sensitive information.
The very first error [1] indicates that the raft replication ran into issues and the next set of fatals in yb::tablet::Tablet::WriteToRocksDB() reported in #22344 are cascading failures. I checked the other nodes and it seems like this is being hit only on N2. So it does look like a packet corruption of some sort on the node N2.
[1]
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0509 20:39:43.549770 33092 tablet_peer.cc:1378] Invalid value of enum consensus::OperationType (full enum type: yb::consensus::OperationType, expression: replicate_msg->op_type()): 0.
@shamanthchandra-yb , Can we check if this repros on clusters with TLS enabled?
@rthallamko3 Many versions of this test case are being run currently as part of CDC PG Parity testing. This was the only one-off run where this issue occurred, 2 weeks back. There seems to be a very minuscule chance of hitting it, even without TLS. I don't think that even if it passes, we will have sufficient data to confirm the theory. Please share if you think if it would be still helpful, if we run with TLS.
@shamanthchandra-yb , Can we check if this repros on clusters with TLS enabled? I think you were planning to run it in that configuration.
@rthallamko3 I performed CDC runs with TLS enabled, and the issue did not reoccur. I believe we can close this for now, and I'll reopen it if it happens again. Thanks.