yugabyte-db icon indicating copy to clipboard operation
yugabyte-db copied to clipboard

[DocDB] Failed to write a batch with 0 operations into RocksDB: Corruption (yb/dockv/doc_kv_util.cc:135): Error when decoding hashed components of a document key

Open shamanthchandra-yb opened this issue 9 months ago • 1 comments

Jira Link: DB-11251

Description

One of the CDC run, while we are testing PG Parity, observed corruption fatal on 2024.1.0.0-b123

F20240509 20:44:08 ../../src/yb/tablet/tablet.cc:1517] T 6b08eefae2b84fa189350a8c956a904f P d132e7cd9cf549f3b6f9df427e383c04: Failed to write a batch with 0 operations into RocksDB: Corruption (yb/dockv/doc_kv_util.cc:135): Error when decoding hashed components of a document key: while consuming primitive values from 5366326436323032612D363962342D343362372D386532632D6462643530633632663538353A31353838393A346532696563655A3F696C30673F6766306A3F6B3365313F323032686331696534686B683566336A66646630616A6D3168616867316B616C346632613331336964326A6635696C693667346B67656731626B6E3269626968326C626D356733623432346A65336B67366A6D6A3621324F4D513632322380013C328BE6C5BE804A: Encoded string is not terminated with \0x00\0x00
    @     0x56030e766427  google::LogMessage::SendToLog()
    @     0x56030e76735d  google::LogMessage::Flush()
    @     0x56030e7679a9  google::LogMessageFatal::~LogMessageFatal()
    @     0x56030fbbb933  yb::tablet::Tablet::WriteToRocksDB()
    @     0x56030fbb7905  yb::tablet::Tablet::ApplyIntents()
    @     0x56030fbb84d2  yb::tablet::Tablet::ApplyIntents()
    @     0x56030fc70b71  yb::tablet::TransactionParticipant::Impl::ProcessReplicated()
    @     0x56030fb9204c  yb::tablet::UpdateTxnOperation::DoReplicated()
    @     0x56030fb8579e  yb::tablet::Operation::Replicated()
    @     0x56030fb87b4f  yb::tablet::OperationDriver::ReplicationFinished()
    @     0x56030ec3ca2b  yb::consensus::ConsensusRound::NotifyReplicationFinished()
    @     0x56030ec8b38f  yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
    @     0x56030ec8a6f9  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0x56030ec74004  yb::consensus::RaftConsensus::UpdateReplica()
    @     0x56030ec53f83  yb::consensus::RaftConsensus::Update()
    @     0x56030fed9f9e  yb::tserver::ConsensusServiceImpl::UpdateConsensus()
    @     0x56030ece1fee  std::__1::__function::__func<>::operator()()
    @     0x56030ece2c1f  yb::consensus::ConsensusServiceIf::Handle()
    @     0x56030fadf649  yb::rpc::ServicePoolImpl::Handle()
    @     0x56030f9fc05f  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x56030faeee43  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x560310306c13  yb::Thread::SuperviseThread()
    @     0x7f73305551ca  start_thread
    @     0x7f73307a6e73  __GI___clone

Here’s what testcase is about:

Create 10 databases
Create 1 table in each, for which we are interested in CDC streaming
In iteration manner, load data in each of the table. Validate they are streaming.
In parallel, there are nemesis happening. Server side nemesis in this run were: Stop/start nodes, Restart master process. 
In parallel I am also creating and dropping dummy tables randomly.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • [X] I confirm this issue does not contain any sensitive information.

shamanthchandra-yb avatar May 10 '24 07:05 shamanthchandra-yb

The very first error [1] indicates that the raft replication ran into issues and the next set of fatals in yb::tablet::Tablet::WriteToRocksDB() reported in #22344 are cascading failures. I checked the other nodes and it seems like this is being hit only on N2. So it does look like a packet corruption of some sort on the node N2.

[1]

Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0509 20:39:43.549770 33092 tablet_peer.cc:1378] Invalid value of enum consensus::OperationType (full enum type: yb::consensus::OperationType, expression: replicate_msg->op_type()): 0.

rthallamko3 avatar May 10 '24 21:05 rthallamko3

@shamanthchandra-yb , Can we check if this repros on clusters with TLS enabled?

rthallamko3 avatar May 21 '24 22:05 rthallamko3

@rthallamko3 Many versions of this test case are being run currently as part of CDC PG Parity testing. This was the only one-off run where this issue occurred, 2 weeks back. There seems to be a very minuscule chance of hitting it, even without TLS. I don't think that even if it passes, we will have sufficient data to confirm the theory. Please share if you think if it would be still helpful, if we run with TLS.

shamanthchandra-yb avatar May 22 '24 04:05 shamanthchandra-yb

@shamanthchandra-yb , Can we check if this repros on clusters with TLS enabled? I think you were planning to run it in that configuration.

rthallamko3 avatar Jun 18 '24 20:06 rthallamko3

@rthallamko3 I performed CDC runs with TLS enabled, and the issue did not reoccur. I believe we can close this for now, and I'll reopen it if it happens again. Thanks.

shamanthchandra-yb avatar Jun 23 '24 11:06 shamanthchandra-yb