yugabyte-db
yugabyte-db copied to clipboard
[DocDB] Failure seen on 2.1.10.0 trying to get to 300 GB SST size per node with Cassandra workloads
Jira Link: DB-3130
Description
Reported by @def-.
Failure seen on 2.1.10.0 trying to get to 300 GB SST size per node with Cassandra workloads:
[root@centosvm centos]# cat /home/yugabyte/tserver/logs/yb-tserver.FATAL.details.2022-08-07T10_53_24.pid25718.txt
F20220807 10:53:24 ../../src/yb/tablet/mvcc.cc:398] T b9ed74f42cf04df396efdc6137080ae6 P 4565840f4da74933a568cda31d87a7b3: T b9ed74f42cf04df396efdc6137080ae6 P 4565840f4da74933a568cda31d87a7b3: Recent 32 MVCC operations:
1. AddPending { provided_ht: { physical: 1659869482751078 } final_ht: { physical: 1659869482751078 } }
2. AddPending { provided_ht: { physical: 1659869483082134 } final_ht: { physical: 1659869483082134 } }
3. AddPending { provided_ht: { physical: 1659869483171676 } final_ht: { physical: 1659869483171676 } }
4. AddPending { provided_ht: { physical: 1659869483846794 } final_ht: { physical: 1659869483846794 } }
5. AddPending { provided_ht: { physical: 1659869485760018 } final_ht: { physical: 1659869485760018 } }
6. AddPending { provided_ht: { physical: 1659869488630151 } final_ht: { physical: 1659869488630151 } }
7. AddPending { provided_ht: { physical: 1659869488657451 } final_ht: { physical: 1659869488657451 } }
8. AddPending { provided_ht: { physical: 1659869488899036 } final_ht: { physical: 1659869488899036 } }
9. AddPending { provided_ht: { physical: 1659869488928469 } final_ht: { physical: 1659869488928469 } }
10. AddPending { provided_ht: { physical: 1659869489409507 } final_ht: { physical: 1659869489409507 } }
11. AddPending { provided_ht: { physical: 1659869489409518 } final_ht: { physical: 1659869489409518 } }
12. AddPending { provided_ht: { physical: 1659869489643053 } final_ht: { physical: 1659869489643053 } }
13. AddPending { provided_ht: { physical: 1659869489643064 } final_ht: { physical: 1659869489643064 } }
14. AddPending { provided_ht: { physical: 1659869489785710 } final_ht: { physical: 1659869489785710 } }
15. AddPending { provided_ht: { physical: 1659869490018739 } final_ht: { physical: 1659869490018739 } }
16. AddPending { provided_ht: { physical: 1659869508007529 } final_ht: { physical: 1659869508007529 } }
17. AddPending { provided_ht: { physical: 1659869508427101 } final_ht: { physical: 1659869508427101 } }
18. AddPending { provided_ht: { physical: 1659869508685347 } final_ht: { physical: 1659869508685347 } }
19. AddPending { provided_ht: { physical: 1659869508884479 } final_ht: { physical: 1659869508884479 } }
20. AddPending { provided_ht: { physical: 1659869509765492 } final_ht: { physical: 1659869509765492 } }
21. AddPending { provided_ht: { physical: 1659869512093569 } final_ht: { physical: 1659869512093569 } }
22. AddPending { provided_ht: { physical: 1659869514263447 } final_ht: { physical: 1659869514263447 } }
23. AddPending { provided_ht: { physical: 1659869515241754 } final_ht: { physical: 1659869515241754 } }
24. AddPending { provided_ht: { physical: 1659869521264073 } final_ht: { physical: 1659869521264073 } }
25. AddPending { provided_ht: { physical: 1659869521264080 } final_ht: { physical: 1659869521264080 } }
26. AddPending { provided_ht: { physical: 1659869528324544 } final_ht: { physical: 1659869528324544 } }
27. AddPending { provided_ht: { physical: 1659869545620858 } final_ht: { physical: 1659869545620858 } }
28. AddPending { provided_ht: { physical: 1659869550060913 } final_ht: { physical: 1659869550060913 } }
29. AddPending { provided_ht: { physical: 1659869552225451 } final_ht: { physical: 1659869552225451 } }
30. AddPending { provided_ht: { physical: 1659869565717022 } final_ht: { physical: 1659869565717022 } }
31. AddPending { provided_ht: { physical: 1659869575388973 } final_ht: { physical: 1659869575388973 } }
32. AddPending { provided_ht: { physical: 1659869578962262 } final_ht: { physical: 1659869578962262 } }
New operation's hybrid time too low: { physical: 1659869578962262 }
max_safe_time_returned_with_lease_={ safe_time: { physical: 1659869433613145 logical: 4095 } source: kNextInQueue }
*ht < max_safe_time_returned_with_lease_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=595349979137
ht->PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=145349117
max_safe_time_returned_without_lease_={ safe_time: <min> source: kUnknown }
*ht < max_safe_time_returned_without_lease_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=6798825795429425152
ht->PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=1659869578962262
max_safe_time_returned_for_follower_={ safe_time: <min> source: kUnknown }
*ht < max_safe_time_returned_for_follower_.safe_time=0
static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=6798825795429425152
ht->PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1659869578962262
(SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown})={ safe_time: { physical: 1659869427635078 } source: kUnknown }
*ht < (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time=0
static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time.ToUint64())=619836145664
ht->PhysicalDiff((SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time)=151327184
(SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown})={ safe_time: { physical: 1659869578962262 } source: kUnknown }
*ht < (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time=0
static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time.ToUint64())=0
ht->PhysicalDiff((SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time)=0
is_follower_side=1
queue_.size()=136
queue_=[{ physical: 1659869441205059 }, { physical: 1659869441313426 }, { physical: 1659869441354807 }, { physical: 1659869441408976 }, { physical: 1659869441755459 }, { physical: 1659869441757999 }, { physical: 1659869441758006 }, { physical: 1659869441768249 }, { physical: 1659869441852956 }, { physical: 1659869441922293 }, { physical: 1659869441949596 }, { physical: 1659869441956667 }, { physical: 1659869441970522 }, { physical: 1659869442003832 }, { physical: 1659869442025827 }, { physical: 1659869442036875 }, { physical: 1659869442723039 }, { physical: 1659869442724927 }, { physical: 1659869442724942 }, { physical: 1659869442724944 }, { physical: 1659869442724947 }, { physical: 1659869442724948 }, { physical: 1659869442724951 }, { physical: 1659869442724953 }, { physical: 1659869442724956 }, { physical: 1659869442724959 }, { physical: 1659869442724962 }, { physical: 1659869442724964 }, { physical: 1659869442724969 }, { physical: 1659869442753873 }, { physical: 1659869442753878 }, { physical: 1659869442800523 }, { physical: 1659869442844887 }, { physical: 1659869443064144 }, { physical: 1659869443107400 }, { physical: 1659869443516899 }, { physical: 1659869443789991 }, { physical: 1659869443874314 }, { physical: 1659869444076513 }, { physical: 1659869444209794 }, { physical: 1659869444209799 }, { physical: 1659869444261645 }, { physical: 1659869445808870 }, { physical: 1659869445812110 }, { physical: 1659869445883082 }, { physical: 1659869445883090 }, { physical: 1659869445883094 }, { physical: 1659869445883098 }, { physical: 1659869445883109 }, { physical: 1659869446138755 }, { physical: 1659869446138762 }, { physical: 1659869446138769 }, { physical: 1659869446138772 }, { physical: 1659869446442032 }, { physical: 1659869446573606 }, { physical: 1659869446573623 }, { physical: 1659869446624111 }, { physical: 1659869446675021 }, { physical: 1659869446827586 }, { physical: 1659869446837270 }, { physical: 1659869447229895 }, { physical: 1659869449178604 }, { physical: 1659869449178606 }, { physical: 1659869449178609 }, { physical: 1659869449329995 }, { physical: 1659869449329999 }, { physical: 1659869449330005 }, { physical: 1659869449389951 }, { physical: 1659869449389958 }, { physical: 1659869449432977 }, { physical: 1659869451412428 }, { physical: 1659869451412436 }, { physical: 1659869451412440 }, { physical: 1659869451412445 }, { physical: 1659869451412454 }, { physical: 1659869451412458 logical: 1 }, { physical: 1659869451459272 }, { physical: 1659869451523763 }, { physical: 1659869451652013 }, { physical: 1659869451880641 }, { physical: 1659869454470780 }, { physical: 1659869454640213 }, { physical: 1659869454708605 }, { physical: 1659869454821290 }, { physical: 1659869454821302 }, { physical: 1659869456947111 }, { physical: 1659869456949968 }, { physical: 1659869457117331 }, { physical: 1659869460730933 }, { physical: 1659869460730944 }, { physical: 1659869460730951 }, { physical: 1659869462185691 }, { physical: 1659869462524689 }, { physical: 1659869462524693 }, { physical: 1659869462747429 }, { physical: 1659869475669087 }, { physical: 1659869475961035 }, { physical: 1659869476694691 }, { physical: 1659869477727113 }, { physical: 1659869478520856 }, { physical: 1659869478520864 }, { physical: 1659869479838688 }, { physical: 1659869481962259 }, { physical: 1659869482403517 }, { physical: 1659869482751078 }, { physical: 1659869483082134 }, { physical: 1659869483171676 }, { physical: 1659869483846794 }, { physical: 1659869485760018 }, { physical: 1659869488630151 }, { physical: 1659869488657451 }, { physical: 1659869488899036 }, { physical: 1659869488928469 }, { physical: 1659869489409507 }, { physical: 1659869489409518 }, { physical: 1659869489643053 }, { physical: 1659869489643064 }, { physical: 1659869489785710 }, { physical: 1659869490018739 }, { physical: 1659869508007529 }, { physical: 1659869508427101 }, { physical: 1659869508685347 }, { physical: 1659869508884479 }, { physical: 1659869509765492 }, { physical: 1659869512093569 }, { physical: 1659869514263447 }, { physical: 1659869515241754 }, { physical: 1659869521264073 }, { physical: 1659869521264080 }, { physical: 1659869528324544 }, { physical: 1659869545620858 }, { physical: 1659869550060913 }, { physical: 1659869552225451 }, { physical: 1659869565717022 }, { physical: 1659869575388973 }, { physical: 1659869578962262 }]
aborted=[]
@ 0x7f6e208fe12c yb::LogFatalHandlerSink::send()
@ 0x7f6e1fadf346 google::LogMessage::SendToLog()
@ 0x7f6e1fadc7aa google::LogMessage::Flush()
@ 0x7f6e1fadf879 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f6e28f8a2de yb::tablet::MvccManager::AddPending()
@ 0x7f6e28f34f6b yb::tablet::TabletPeer::StartReplicaOperation()
@ 0x7f6e28ba5e11 yb::consensus::RaftConsensus::StartReplicaOperationUnlocked()
@ 0x7f6e28ba8a0d yb::consensus::RaftConsensus::EnqueuePreparesUnlocked()
@ 0x7f6e28bb6077 yb::consensus::RaftConsensus::UpdateReplica()
@ 0x7f6e28bb7083 yb::consensus::RaftConsensus::Update()
@ 0x7f6e29810f11 yb::tserver::ConsensusServiceImpl::UpdateConsensus()
@ 0x7f6e26568bca yb::consensus::ConsensusServiceIf::Handle()
@ 0x7f6e2214cbc9 yb::rpc::ServicePoolImpl::Handle()
@ 0x7f6e220f4574 yb::rpc::InboundCall::InboundCallTask::Run()
@ 0x7f6e22158818 yb::rpc::(anonymous namespace)::Worker::Execute()
@ 0x7f6e2098fe6f yb::Thread::SuperviseThread()
@ 0x7f6e1b1bd694 start_thread
@ 0x7f6e1a8fa41d __clone
@ (nil) (unknown)
Seen on 2.12.9.1-b1 as well with a hard inserting workload:
[centos@hostname ~]$ sudo cat /home/yugabyte/tserver/logs/yb-tserver.FATAL.details.2022-08-11T17_58_29.pid7951.txt
F20220811 17:58:29 ../../src/yb/tablet/mvcc.cc:392] T 29dc41ef1bbf459596b10f9505814bbc P bcfadf5152d547afa34bf5efce83971f: T 29dc41ef1bbf459596b10f9505814bbc P bcfadf5152d547afa34bf5efce83971f: Recent 32 MVCC operations:
1. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240612554013 } lease: <min> } safe_time: { physical: 1660240601401805 } }
2. UpdatePropagatedSafeTimeOnLeader { ht_lease: { time: { physical: 1660240612610855 } lease: { physical: 1660240614554016 } } safe_time: { physical: 1660240612610855 } }
3. LastReplicatedHybridTime { last_replicated: { physical: 1660240601401805 } }
4. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240612610885 } lease: { physical: 1660240614554016 } } safe_time: { physical: 1660240612610885 } }
5. UpdatePropagatedSafeTimeOnLeader { ht_lease: { time: { physical: 1660240612616740 } lease: { physical: 1660240614554016 } } safe_time: { physical: 1660240612616740 } }
6. AddLeaderPending { ht: { physical: 1660240618627556 } op_id: 27.667725 }
7. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240622872519 } lease: { physical: 1660240614554016 } } safe_time: { physical: 1660240614554016 } }
8. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240624157380 } lease: { physical: 1660240614554016 } } safe_time: { physical: 1660240614554016 } }
9. UpdatePropagatedSafeTimeOnLeader { ht_lease: { time: { physical: 1660240624157592 } lease: { physical: 1660240624872523 } } safe_time: { physical: 1660240618627555 logical: 4095 } }
10. AddLeaderPending { ht: { physical: 1660240626822898 } op_id: 27.667726 }
11. UpdatePropagatedSafeTimeOnLeader { ht_lease: { time: { physical: 1660240626830669 } lease: { physical: 1660240624872523 } } safe_time: { physical: 1660240618627555 logical: 4095 } }
12. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240632688026 } lease: { physical: 1660240624872523 } } safe_time: { physical: 1660240618627555 logical: 4095 } }
13. Aborted { ht: { physical: 1660240626822898 } op_id: 27.667726 }
14. Aborted { ht: { physical: 1660240618627556 } op_id: 27.667725 }
15. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240643482770 } lease: <min> } safe_time: { physical: 1660240618627555 logical: 4095 } }
16. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240643482807 } lease: <min> } safe_time: { physical: 1660240618627555 logical: 4095 } }
17. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240646353877 } lease: <min> } safe_time: { physical: 1660240618627555 logical: 4095 } }
18. SetPropagatedSafeTimeOnFollower { ht: { physical: 1660240658619962 } }
19. AddFollowerPending { ht: { physical: 1660240659037521 } op_id: 31.667726 }
20. SetPropagatedSafeTimeOnFollower { ht: { physical: 1660240659037520 logical: 4095 } }
21. AddFollowerPending { ht: { physical: 1660240659306059 } op_id: 31.667727 }
22. SetPropagatedSafeTimeOnFollower { ht: { physical: 1660240659037520 logical: 4095 } }
23. Replicated { ht: { physical: 1660240659037521 } op_id: 31.667726 }
24. SetPropagatedSafeTimeOnFollower { ht: { physical: 1660240659306058 logical: 4095 } }
25. Replicated { ht: { physical: 1660240659306059 } op_id: 31.667727 }
26. SetPropagatedSafeTimeOnFollower { ht: { physical: 1660240660429737 } }
27. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240677032034 } lease: <min> } safe_time: { physical: 1660240659306059 } }
28. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240677032065 } lease: <min> } safe_time: { physical: 1660240659306059 } }
29. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240677088560 } lease: <min> } safe_time: { physical: 1660240659306059 } }
30. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240681529826 } lease: <min> } safe_time: { physical: 1660240659306059 } }
31. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240682725484 } lease: <min> } safe_time: { physical: 1660240659306059 } }
32. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240688704545 } lease: { physical: 1660240683529847 } } safe_time: { physical: 1660240683529847 } }
New operation's hybrid time too low: { physical: 1660240664554983 }, op id: 31.667728
max_safe_time_returned_with_lease_={ safe_time: { physical: 1660240683529847 } source: kHybridTimeLease }
!!! ht <= max_safe_time_returned_with_lease_.safe_time=1
static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=-77721042944
ht.PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=-18974864
max_safe_time_returned_without_lease_={ safe_time: <min> source: kUnknown }
ht <= max_safe_time_returned_without_lease_.safe_time=0
static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=6800345762017210368
ht.PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=1660240664554983
max_safe_time_returned_for_follower_={ safe_time: <min> source: kUnknown }
ht <= max_safe_time_returned_for_follower_.safe_time=0
static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=6800345762017210368
ht.PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1660240664554983
last_replicated_={ physical: 1660240659306059 }
ht <= last_replicated_=0
static_cast<int64_t>(ht.ToUint64() - last_replicated_.ToUint64())=21499592704
ht.PhysicalDiff(last_replicated_)=5248924
last_ht_in_queue=<min>
ht <= last_ht_in_queue=0
static_cast<int64_t>(ht.ToUint64() - last_ht_in_queue.ToUint64())=6800345762017210368
ht.PhysicalDiff(last_ht_in_queue)=1660240664554983
propagated_safe_time_={ physical: 1660240660429737 }
ht <= propagated_safe_time_=0
static_cast<int64_t>(ht.ToUint64() - propagated_safe_time_.ToUint64())=16897007616
ht.PhysicalDiff(propagated_safe_time_)=4125246
queue_.size()=0
queue_=[]
@ 0x7f6dcdb45a1c yb::LogFatalHandlerSink::send()
@ 0x7f6dccf43fde google::LogMessage::SendToLog()
@ 0x7f6dccf4116a google::LogMessage::Flush()
@ 0x7f6dccf44859 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f6dd7dfe654 yb::tablet::MvccManager::AddPending()
@ 0x7f6dd7dfeded yb::tablet::MvccManager::AddFollowerPending()
@ 0x7f6dd7de1aee yb::tablet::Operation::AddedToFollower()
@ 0x7f6dd7de619e yb::tablet::OperationDriver::Init()
@ 0x7f6dd7d9e9eb yb::tablet::TabletPeer::NewOperationDriver()
@ 0x7f6dd7d9ece8 yb::tablet::TabletPeer::NewReplicaOperationDriver()
@ 0x7f6dd7d9efdd yb::tablet::TabletPeer::StartReplicaOperation()
@ 0x7f6dd79ece46 yb::consensus::RaftConsensus::StartReplicaOperationUnlocked()
@ 0x7f6dd79f184c yb::consensus::RaftConsensus::EnqueuePreparesUnlocked()
@ 0x7f6dd79fd3fb yb::consensus::RaftConsensus::UpdateReplica()
@ 0x7f6dd79fe71c yb::consensus::RaftConsensus::Update()
@ 0x7f6dd8be61cf yb::tserver::ConsensusServiceImpl::UpdateConsensus()
@ 0x7f6dd471c3b6 _ZN2yb3rpc10HandleCallINS0_19RpcCallPBParamsImplINS_9consensus18ConsensusRequestPBENS3_19ConsensusResponsePBEEEZZNS3_18ConsensusServiceIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENKUlSt10shared_ptrINS0_11InboundCallEEE_clESF_EUlPKS4_PS5_NS0_10RpcContextEE_EEDaSF_T0_
@ 0x7f6dd471c56d _ZNSt17_Function_handlerIFvSt10shared_ptrIN2yb3rpc11InboundCallEEEZNS1_9consensus18ConsensusServiceIf11InitMethodsERK13scoped_refptrINS1_12MetricEntityEEEUlS4_E_E9_M_invokeERKSt9_Any_dataOS4_
@ 0x7f6dd4715a94 yb::consensus::ConsensusServiceIf::Handle()
@ 0x7f6dcea94b3e yb::rpc::ServicePoolImpl::Handle()
@ 0x7f6dcea36d44 yb::rpc::InboundCall::InboundCallTask::Run()
@ 0x7f6dceaa8978 yb::rpc::(anonymous namespace)::Worker::Execute()
@ 0x7f6dcdbe8ef5 yb::Thread::SuperviseThread()
@ 0x7f6dc94c4694 start_thread
@ 0x7f6dc8c0141d __clone
@ (nil) (unknown)
Some failure traces provided by @def- from the same cluster:
- https://gist.githubusercontent.com/mbautin/54d442d91467284395d254a121aaea94/raw (same as posted above at https://github.com/yugabyte/yugabyte-db/issues/13548#issuecomment-1212326765 ).
- Problematic condition:
!!! ht <= max_safe_time_returned_with_lease_.safe_time=1
- Problematic condition:
- https://gist.githubusercontent.com/mbautin/133d66703a6eca81ad8869ee54ed0be3/raw
- Problematic condition:
static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time.ToUint64())=0
(no strict inequality, so no!!!
marker). An operation at the same hybrid time is repeatedly aborted and re-added, and then added two times in a row:AddPending { provided_ht: { physical: 1659815744075129 } final_ht: { physical: 1659815744075129 } }
- Problematic condition:
- https://gist.githubusercontent.com/mbautin/576bea07e59ea8cd45434e6c316fe7ce/raw
- Same issue as the last one.
Doesn't happen with enable_lease_revocation=false
gflag.
Slack thread - https://yugabyte.slack.com/archives/C4K1838GL/p1654194951287439
@def- btw, do you know if this was affecting 2.12, but not 2.8? would it be easy enough to try to repro on 2.8? Asking since I don't believe we backported disabling this to 2.8.
It required very high load and a few hours of running, I'm not sure how well it reproduces. Should I try on 2.8?
@def- if it's something you can leave running in the background, that would be very useful!
- if it crashes on 2.8, we would need to backport the disable there as well
- if it doesn't crash, we could use that as signal that something broke in between 2.8 and 2.12
Sure, will attempt it now against fresh 2.8 universe: https://portal.dev.yugabyte.com/universes/17807c94-fe3f-40bd-a1ab-6db5e0f60da4/tasks
I could not reproduce the issue on 2.8 in ~18 hours. I will keep it running for some more hours but then shut down the universe again in the interest of cost. Is there any reason why we expect this to be failing in 2.1 and 2.12, but not 2.8?
One difference I noticed is that this time the universe is all in the same region, while previously it was spread across 3 regions in the US. Could this make a difference? Also the previous universe, whether on 2.1 or 2.12, had been initially created and the tables created as well, on 2.1 version.
I increased the load of writer threads of CassandraBatchKeyValue and got it again: (I think it was when I started another run with the same table, overwriting existing entries with 800 threads. Existing runes were going for a few hours already and keeping load high on the universe.)
[yugabyte@yb-15-dfelsing-mvcc-n15 ~]$ cat /home/yugabyte/tserver/logs/yb-tserver.FATAL.details.2022-10-05T07_51_23.pid11733.txt
F20221005 07:51:23 ../../src/yb/tablet/mvcc.cc:391] T fecee17aa0024a8ea536ff35ef44c6be P 681b0cc202714b1ca8ab8e73112d4cd1: T fecee17aa0024a8ea536ff35ef44c6be P 681b0cc202714b1ca8ab8e73112d4cd1: Recent 32 MVCC operations:
1. Replicated { ht: { physical: 1664956141029253 } op_id: 2.13255 }
2. Replicated { ht: { physical: 1664956141063296 } op_id: 2.13256 }
3. Replicated { ht: { physical: 1664956141065800 } op_id: 2.13257 }
4. Replicated { ht: { physical: 1664956141070167 } op_id: 2.13258 }
5. Replicated { ht: { physical: 1664956141316200 } op_id: 2.13259 }
6. Replicated { ht: { physical: 1664956141317478 } op_id: 2.13260 }
7. Replicated { ht: { physical: 1664956141401706 } op_id: 2.13261 }
8. Replicated { ht: { physical: 1664956141441274 } op_id: 2.13262 }
9. Replicated { ht: { physical: 1664956141453682 } op_id: 2.13263 }
10. SetPropagatedSafeTimeOnFollower { ht: { physical: 1664956141861621 logical: 4095 } }
11. Replicated { ht: { physical: 1664956141861622 } op_id: 2.13264 }
12. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956167610195 } lease: <min> } safe_time: { physical: 1664956141861622 } }
13. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956167610195 logical: 1 } lease: <min> } safe_time: { physical: 1664956141861622 } }
14. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956167615674 } lease: <min> } safe_time: { physical: 1664956141861622 } }
15. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956171026981 } lease: <min> } safe_time: { physical: 1664956141861622 } }
16. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956171687013 } lease: <min> } safe_time: { physical: 1664956141861622 } }
17. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956175161422 } lease: <min> } safe_time: { physical: 1664956141861622 } }
18. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956175366198 } lease: <min> } safe_time: { physical: 1664956141861622 } }
19. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956176121772 } lease: { physical: 1664956177366205 } } safe_time: { physical: 1664956176121772 } }
20. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956189733642 } lease: { physical: 1664956177366205 } } safe_time: { physical: 1664956177366205 } }
21. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956191031210 } lease: { physical: 1664956177366205 } } safe_time: { physical: 1664956177366205 } }
22. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956195822196 } lease: { physical: 1664956177366205 } } safe_time: { physical: 1664956177366205 } }
23. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956243060510 } lease: <min> } safe_time: { physical: 1664956177366205 } }
24. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956243060531 } lease: <min> } safe_time: { physical: 1664956177366205 } }
25. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956243484315 } lease: <min> } safe_time: { physical: 1664956177366205 } }
26. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956247089893 } lease: <min> } safe_time: { physical: 1664956177366205 } }
27. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956251424342 } lease: <min> } safe_time: { physical: 1664956177366205 } }
28. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956253016515 } lease: <min> } safe_time: { physical: 1664956177366205 } }
29. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956256328317 } lease: { physical: 1664956253424351 } } safe_time: { physical: 1664956253424351 } }
30. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956256818762 } lease: { physical: 1664956253424351 } } safe_time: { physical: 1664956253424351 } }
31. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956271148750 } lease: { physical: 1664956258328324 } } safe_time: { physical: 1664956258328324 } }
32. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956272276034 } lease: { physical: 1664956258328324 } } safe_time: { physical: 1664956258328324 } }
New operation's hybrid time too low: { physical: 1664956243259606 }, op id: 6.13267
max_safe_time_returned_with_lease_={ safe_time: { physical: 1664956258328324 } source: kHybridTimeLease }
!!! ht <= max_safe_time_returned_with_lease_.safe_time=1
static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=-61721468928
ht.PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=-15068718
max_safe_time_returned_without_lease_={ safe_time: <min> source: kUnknown }
ht <= max_safe_time_returned_without_lease_.safe_time=0
static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=6819660772391346176
ht.PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=1664956243259606
max_safe_time_returned_for_follower_={ safe_time: <min> source: kUnknown }
ht <= max_safe_time_returned_for_follower_.safe_time=0
static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=6819660772391346176
ht.PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1664956243259606
last_replicated_={ physical: 1664956141861622 }
ht <= last_replicated_=0
static_cast<int64_t>(ht.ToUint64() - last_replicated_.ToUint64())=415326142464
ht.PhysicalDiff(last_replicated_)=101397984
last_ht_in_queue=<min>
ht <= last_ht_in_queue=0
static_cast<int64_t>(ht.ToUint64() - last_ht_in_queue.ToUint64())=6819660772391346176
ht.PhysicalDiff(last_ht_in_queue)=1664956243259606
propagated_safe_time_={ physical: 1664956141861621 logical: 4095 }
ht <= propagated_safe_time_=0
static_cast<int64_t>(ht.ToUint64() - propagated_safe_time_.ToUint64())=415326142465
ht.PhysicalDiff(propagated_safe_time_)=101397985
queue_.size()=0
queue_=[]
@ 0x7f71df4acffc yb::LogFatalHandlerSink::send()
@ 0x7f71dcc318ee google::LogMessage::SendToLog()
@ 0x7f71dcc2ea7a google::LogMessage::Flush()
@ 0x7f71dcc32169 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f71e82bcc74 yb::tablet::MvccManager::AddPending()
@ 0x7f71e82bd40d yb::tablet::MvccManager::AddFollowerPending()
@ 0x7f71e829fc44 yb::tablet::Operation::AddedToFollower()
@ 0x7f71e82a40f1 yb::tablet::OperationDriver::Init()
@ 0x7f71e825bfcb yb::tablet::TabletPeer::NewOperationDriver()
@ 0x7f71e825c148 yb::tablet::TabletPeer::NewReplicaOperationDriver()
@ 0x7f71e825c421 yb::tablet::TabletPeer::StartReplicaOperation()
@ 0x7f71e7e8da46 yb::consensus::RaftConsensus::StartReplicaOperationUnlocked()
@ 0x7f71e7e91fcc yb::consensus::RaftConsensus::EnqueuePreparesUnlocked()
@ 0x7f71e7e9e4f1 yb::consensus::RaftConsensus::UpdateReplica()
@ 0x7f71e7e9f6ec yb::consensus::RaftConsensus::Update()
@ 0x7f71e8dfa5ae yb::tserver::ConsensusServiceImpl::UpdateConsensus()
@ 0x7f71e4db2f17 yb::consensus::ConsensusServiceIf::Handle()
@ 0x7f71df948689 yb::rpc::ServicePoolImpl::Handle()
@ 0x7f71df8ebaa4 yb::rpc::InboundCall::InboundCallTask::Run()
@ 0x7f71df95abf8 yb::rpc::(anonymous namespace)::Worker::Execute()
@ 0x7f71df547bbf yb::Thread::SuperviseThread()
@ 0x7f71d97e9694 start_thread
@ 0x7f71d8f2641d __clone
@ (nil) (unknown)
Edit: I have deleted the universe now, can't pause it because of 4 x local disk.
Thanks @def- , my bad, so this is present across all releases then! cc @rthallamko3 @robertsami Should we then disable lease revocation for 2.8 as well?
I tried reproducing this locally on 1 node / 3 node but didn't manage: https://gist.github.com/def-/8f16c5f6966ac6136fd65c578b68c9c9 It was difficult even on 18 node to run into this and took a few hours, I'm not sure how to trigger it easier.
I haven't seen this again with @Huqicheng's fix: https://github.com/yugabyte/yugabyte-db/commit/a4d02ef6c65885e227eb1c61a4d8c81cbcf39d3c
So closing the bug as fixed.