yugabyte-db icon indicating copy to clipboard operation
yugabyte-db copied to clipboard

[DocDB] Failure seen on 2.1.10.0 trying to get to 300 GB SST size per node with Cassandra workloads

Open mbautin opened this issue 2 years ago • 1 comments

Jira Link: DB-3130

Description

Reported by @def-.

Failure seen on 2.1.10.0 trying to get to 300 GB SST size per node with Cassandra workloads:

[root@centosvm centos]# cat /home/yugabyte/tserver/logs/yb-tserver.FATAL.details.2022-08-07T10_53_24.pid25718.txt
F20220807 10:53:24 ../../src/yb/tablet/mvcc.cc:398] T b9ed74f42cf04df396efdc6137080ae6 P 4565840f4da74933a568cda31d87a7b3: T b9ed74f42cf04df396efdc6137080ae6 P 4565840f4da74933a568cda31d87a7b3: Recent 32 MVCC operations:
1. AddPending { provided_ht: { physical: 1659869482751078 } final_ht: { physical: 1659869482751078 } }
2. AddPending { provided_ht: { physical: 1659869483082134 } final_ht: { physical: 1659869483082134 } }
3. AddPending { provided_ht: { physical: 1659869483171676 } final_ht: { physical: 1659869483171676 } }
4. AddPending { provided_ht: { physical: 1659869483846794 } final_ht: { physical: 1659869483846794 } }
5. AddPending { provided_ht: { physical: 1659869485760018 } final_ht: { physical: 1659869485760018 } }
6. AddPending { provided_ht: { physical: 1659869488630151 } final_ht: { physical: 1659869488630151 } }
7. AddPending { provided_ht: { physical: 1659869488657451 } final_ht: { physical: 1659869488657451 } }
8. AddPending { provided_ht: { physical: 1659869488899036 } final_ht: { physical: 1659869488899036 } }
9. AddPending { provided_ht: { physical: 1659869488928469 } final_ht: { physical: 1659869488928469 } }
10. AddPending { provided_ht: { physical: 1659869489409507 } final_ht: { physical: 1659869489409507 } }
11. AddPending { provided_ht: { physical: 1659869489409518 } final_ht: { physical: 1659869489409518 } }
12. AddPending { provided_ht: { physical: 1659869489643053 } final_ht: { physical: 1659869489643053 } }
13. AddPending { provided_ht: { physical: 1659869489643064 } final_ht: { physical: 1659869489643064 } }
14. AddPending { provided_ht: { physical: 1659869489785710 } final_ht: { physical: 1659869489785710 } }
15. AddPending { provided_ht: { physical: 1659869490018739 } final_ht: { physical: 1659869490018739 } }
16. AddPending { provided_ht: { physical: 1659869508007529 } final_ht: { physical: 1659869508007529 } }
17. AddPending { provided_ht: { physical: 1659869508427101 } final_ht: { physical: 1659869508427101 } }
18. AddPending { provided_ht: { physical: 1659869508685347 } final_ht: { physical: 1659869508685347 } }
19. AddPending { provided_ht: { physical: 1659869508884479 } final_ht: { physical: 1659869508884479 } }
20. AddPending { provided_ht: { physical: 1659869509765492 } final_ht: { physical: 1659869509765492 } }
21. AddPending { provided_ht: { physical: 1659869512093569 } final_ht: { physical: 1659869512093569 } }
22. AddPending { provided_ht: { physical: 1659869514263447 } final_ht: { physical: 1659869514263447 } }
23. AddPending { provided_ht: { physical: 1659869515241754 } final_ht: { physical: 1659869515241754 } }
24. AddPending { provided_ht: { physical: 1659869521264073 } final_ht: { physical: 1659869521264073 } }
25. AddPending { provided_ht: { physical: 1659869521264080 } final_ht: { physical: 1659869521264080 } }
26. AddPending { provided_ht: { physical: 1659869528324544 } final_ht: { physical: 1659869528324544 } }
27. AddPending { provided_ht: { physical: 1659869545620858 } final_ht: { physical: 1659869545620858 } }
28. AddPending { provided_ht: { physical: 1659869550060913 } final_ht: { physical: 1659869550060913 } }
29. AddPending { provided_ht: { physical: 1659869552225451 } final_ht: { physical: 1659869552225451 } }
30. AddPending { provided_ht: { physical: 1659869565717022 } final_ht: { physical: 1659869565717022 } }
31. AddPending { provided_ht: { physical: 1659869575388973 } final_ht: { physical: 1659869575388973 } }
32. AddPending { provided_ht: { physical: 1659869578962262 } final_ht: { physical: 1659869578962262 } }
New operation's hybrid time too low: { physical: 1659869578962262 }
  max_safe_time_returned_with_lease_={ safe_time: { physical: 1659869433613145 logical: 4095 } source: kNextInQueue }
  *ht < max_safe_time_returned_with_lease_.safe_time=0
  static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=595349979137
  ht->PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=145349117

  max_safe_time_returned_without_lease_={ safe_time: <min> source: kUnknown }
  *ht < max_safe_time_returned_without_lease_.safe_time=0
  static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=6798825795429425152
  ht->PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=1659869578962262

  max_safe_time_returned_for_follower_={ safe_time: <min> source: kUnknown }
  *ht < max_safe_time_returned_for_follower_.safe_time=0
  static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=6798825795429425152
  ht->PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1659869578962262

  (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown})={ safe_time: { physical: 1659869427635078 } source: kUnknown }
  *ht < (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time=0
  static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time.ToUint64())=619836145664
  ht->PhysicalDiff((SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time)=151327184

  (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown})={ safe_time: { physical: 1659869578962262 } source: kUnknown }
  *ht < (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time=0
  static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time.ToUint64())=0
  ht->PhysicalDiff((SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time)=0

  is_follower_side=1
  queue_.size()=136
  queue_=[{ physical: 1659869441205059 }, { physical: 1659869441313426 }, { physical: 1659869441354807 }, { physical: 1659869441408976 }, { physical: 1659869441755459 }, { physical: 1659869441757999 }, { physical: 1659869441758006 }, { physical: 1659869441768249 }, { physical: 1659869441852956 }, { physical: 1659869441922293 }, { physical: 1659869441949596 }, { physical: 1659869441956667 }, { physical: 1659869441970522 }, { physical: 1659869442003832 }, { physical: 1659869442025827 }, { physical: 1659869442036875 }, { physical: 1659869442723039 }, { physical: 1659869442724927 }, { physical: 1659869442724942 }, { physical: 1659869442724944 }, { physical: 1659869442724947 }, { physical: 1659869442724948 }, { physical: 1659869442724951 }, { physical: 1659869442724953 }, { physical: 1659869442724956 }, { physical: 1659869442724959 }, { physical: 1659869442724962 }, { physical: 1659869442724964 }, { physical: 1659869442724969 }, { physical: 1659869442753873 }, { physical: 1659869442753878 }, { physical: 1659869442800523 }, { physical: 1659869442844887 }, { physical: 1659869443064144 }, { physical: 1659869443107400 }, { physical: 1659869443516899 }, { physical: 1659869443789991 }, { physical: 1659869443874314 }, { physical: 1659869444076513 }, { physical: 1659869444209794 }, { physical: 1659869444209799 }, { physical: 1659869444261645 }, { physical: 1659869445808870 }, { physical: 1659869445812110 }, { physical: 1659869445883082 }, { physical: 1659869445883090 }, { physical: 1659869445883094 }, { physical: 1659869445883098 }, { physical: 1659869445883109 }, { physical: 1659869446138755 }, { physical: 1659869446138762 }, { physical: 1659869446138769 }, { physical: 1659869446138772 }, { physical: 1659869446442032 }, { physical: 1659869446573606 }, { physical: 1659869446573623 }, { physical: 1659869446624111 }, { physical: 1659869446675021 }, { physical: 1659869446827586 }, { physical: 1659869446837270 }, { physical: 1659869447229895 }, { physical: 1659869449178604 }, { physical: 1659869449178606 }, { physical: 1659869449178609 }, { physical: 1659869449329995 }, { physical: 1659869449329999 }, { physical: 1659869449330005 }, { physical: 1659869449389951 }, { physical: 1659869449389958 }, { physical: 1659869449432977 }, { physical: 1659869451412428 }, { physical: 1659869451412436 }, { physical: 1659869451412440 }, { physical: 1659869451412445 }, { physical: 1659869451412454 }, { physical: 1659869451412458 logical: 1 }, { physical: 1659869451459272 }, { physical: 1659869451523763 }, { physical: 1659869451652013 }, { physical: 1659869451880641 }, { physical: 1659869454470780 }, { physical: 1659869454640213 }, { physical: 1659869454708605 }, { physical: 1659869454821290 }, { physical: 1659869454821302 }, { physical: 1659869456947111 }, { physical: 1659869456949968 }, { physical: 1659869457117331 }, { physical: 1659869460730933 }, { physical: 1659869460730944 }, { physical: 1659869460730951 }, { physical: 1659869462185691 }, { physical: 1659869462524689 }, { physical: 1659869462524693 }, { physical: 1659869462747429 }, { physical: 1659869475669087 }, { physical: 1659869475961035 }, { physical: 1659869476694691 }, { physical: 1659869477727113 }, { physical: 1659869478520856 }, { physical: 1659869478520864 }, { physical: 1659869479838688 }, { physical: 1659869481962259 }, { physical: 1659869482403517 }, { physical: 1659869482751078 }, { physical: 1659869483082134 }, { physical: 1659869483171676 }, { physical: 1659869483846794 }, { physical: 1659869485760018 }, { physical: 1659869488630151 }, { physical: 1659869488657451 }, { physical: 1659869488899036 }, { physical: 1659869488928469 }, { physical: 1659869489409507 }, { physical: 1659869489409518 }, { physical: 1659869489643053 }, { physical: 1659869489643064 }, { physical: 1659869489785710 }, { physical: 1659869490018739 }, { physical: 1659869508007529 }, { physical: 1659869508427101 }, { physical: 1659869508685347 }, { physical: 1659869508884479 }, { physical: 1659869509765492 }, { physical: 1659869512093569 }, { physical: 1659869514263447 }, { physical: 1659869515241754 }, { physical: 1659869521264073 }, { physical: 1659869521264080 }, { physical: 1659869528324544 }, { physical: 1659869545620858 }, { physical: 1659869550060913 }, { physical: 1659869552225451 }, { physical: 1659869565717022 }, { physical: 1659869575388973 }, { physical: 1659869578962262 }]
  aborted=[]
    @     0x7f6e208fe12c  yb::LogFatalHandlerSink::send()
    @     0x7f6e1fadf346  google::LogMessage::SendToLog()
    @     0x7f6e1fadc7aa  google::LogMessage::Flush()
    @     0x7f6e1fadf879  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6e28f8a2de  yb::tablet::MvccManager::AddPending()
    @     0x7f6e28f34f6b  yb::tablet::TabletPeer::StartReplicaOperation()
    @     0x7f6e28ba5e11  yb::consensus::RaftConsensus::StartReplicaOperationUnlocked()
    @     0x7f6e28ba8a0d  yb::consensus::RaftConsensus::EnqueuePreparesUnlocked()
    @     0x7f6e28bb6077  yb::consensus::RaftConsensus::UpdateReplica()
    @     0x7f6e28bb7083  yb::consensus::RaftConsensus::Update()
    @     0x7f6e29810f11  yb::tserver::ConsensusServiceImpl::UpdateConsensus()
    @     0x7f6e26568bca  yb::consensus::ConsensusServiceIf::Handle()
    @     0x7f6e2214cbc9  yb::rpc::ServicePoolImpl::Handle()
    @     0x7f6e220f4574  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x7f6e22158818  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x7f6e2098fe6f  yb::Thread::SuperviseThread()
    @     0x7f6e1b1bd694  start_thread
    @     0x7f6e1a8fa41d  __clone
    @              (nil)  (unknown)

mbautin avatar Aug 08 '22 18:08 mbautin

Seen on 2.12.9.1-b1 as well with a hard inserting workload:

[centos@hostname ~]$ sudo cat /home/yugabyte/tserver/logs/yb-tserver.FATAL.details.2022-08-11T17_58_29.pid7951.txt
F20220811 17:58:29 ../../src/yb/tablet/mvcc.cc:392] T 29dc41ef1bbf459596b10f9505814bbc P bcfadf5152d547afa34bf5efce83971f: T 29dc41ef1bbf459596b10f9505814bbc P bcfadf5152d547afa34bf5efce83971f: Recent 32 MVCC operations:
1. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240612554013 } lease: <min> } safe_time: { physical: 1660240601401805 } }
2. UpdatePropagatedSafeTimeOnLeader { ht_lease: { time: { physical: 1660240612610855 } lease: { physical: 1660240614554016 } } safe_time: { physical: 1660240612610855 } }
3. LastReplicatedHybridTime { last_replicated: { physical: 1660240601401805 } }
4. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240612610885 } lease: { physical: 1660240614554016 } } safe_time: { physical: 1660240612610885 } }
5. UpdatePropagatedSafeTimeOnLeader { ht_lease: { time: { physical: 1660240612616740 } lease: { physical: 1660240614554016 } } safe_time: { physical: 1660240612616740 } }
6. AddLeaderPending { ht: { physical: 1660240618627556 } op_id: 27.667725 }
7. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240622872519 } lease: { physical: 1660240614554016 } } safe_time: { physical: 1660240614554016 } }
8. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240624157380 } lease: { physical: 1660240614554016 } } safe_time: { physical: 1660240614554016 } }
9. UpdatePropagatedSafeTimeOnLeader { ht_lease: { time: { physical: 1660240624157592 } lease: { physical: 1660240624872523 } } safe_time: { physical: 1660240618627555 logical: 4095 } }
10. AddLeaderPending { ht: { physical: 1660240626822898 } op_id: 27.667726 }
11. UpdatePropagatedSafeTimeOnLeader { ht_lease: { time: { physical: 1660240626830669 } lease: { physical: 1660240624872523 } } safe_time: { physical: 1660240618627555 logical: 4095 } }
12. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240632688026 } lease: { physical: 1660240624872523 } } safe_time: { physical: 1660240618627555 logical: 4095 } }
13. Aborted { ht: { physical: 1660240626822898 } op_id: 27.667726 }
14. Aborted { ht: { physical: 1660240618627556 } op_id: 27.667725 }
15. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240643482770 } lease: <min> } safe_time: { physical: 1660240618627555 logical: 4095 } }
16. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240643482807 } lease: <min> } safe_time: { physical: 1660240618627555 logical: 4095 } }
17. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240646353877 } lease: <min> } safe_time: { physical: 1660240618627555 logical: 4095 } }
18. SetPropagatedSafeTimeOnFollower { ht: { physical: 1660240658619962 } }
19. AddFollowerPending { ht: { physical: 1660240659037521 } op_id: 31.667726 }
20. SetPropagatedSafeTimeOnFollower { ht: { physical: 1660240659037520 logical: 4095 } }
21. AddFollowerPending { ht: { physical: 1660240659306059 } op_id: 31.667727 }
22. SetPropagatedSafeTimeOnFollower { ht: { physical: 1660240659037520 logical: 4095 } }
23. Replicated { ht: { physical: 1660240659037521 } op_id: 31.667726 }
24. SetPropagatedSafeTimeOnFollower { ht: { physical: 1660240659306058 logical: 4095 } }
25. Replicated { ht: { physical: 1660240659306059 } op_id: 31.667727 }
26. SetPropagatedSafeTimeOnFollower { ht: { physical: 1660240660429737 } }
27. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240677032034 } lease: <min> } safe_time: { physical: 1660240659306059 } }
28. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240677032065 } lease: <min> } safe_time: { physical: 1660240659306059 } }
29. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240677088560 } lease: <min> } safe_time: { physical: 1660240659306059 } }
30. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240681529826 } lease: <min> } safe_time: { physical: 1660240659306059 } }
31. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240682725484 } lease: <min> } safe_time: { physical: 1660240659306059 } }
32. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1660240688704545 } lease: { physical: 1660240683529847 } } safe_time: { physical: 1660240683529847 } }
New operation's hybrid time too low: { physical: 1660240664554983 }, op id: 31.667728
  max_safe_time_returned_with_lease_={ safe_time: { physical: 1660240683529847 } source: kHybridTimeLease }
  !!! ht <= max_safe_time_returned_with_lease_.safe_time=1
  static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=-77721042944
  ht.PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=-18974864

  max_safe_time_returned_without_lease_={ safe_time: <min> source: kUnknown }
  ht <= max_safe_time_returned_without_lease_.safe_time=0
  static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=6800345762017210368
  ht.PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=1660240664554983

  max_safe_time_returned_for_follower_={ safe_time: <min> source: kUnknown }
  ht <= max_safe_time_returned_for_follower_.safe_time=0
  static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=6800345762017210368
  ht.PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1660240664554983

  last_replicated_={ physical: 1660240659306059 }
  ht <= last_replicated_=0
  static_cast<int64_t>(ht.ToUint64() - last_replicated_.ToUint64())=21499592704
  ht.PhysicalDiff(last_replicated_)=5248924

  last_ht_in_queue=<min>
  ht <= last_ht_in_queue=0
  static_cast<int64_t>(ht.ToUint64() - last_ht_in_queue.ToUint64())=6800345762017210368
  ht.PhysicalDiff(last_ht_in_queue)=1660240664554983

  propagated_safe_time_={ physical: 1660240660429737 }
  ht <= propagated_safe_time_=0
  static_cast<int64_t>(ht.ToUint64() - propagated_safe_time_.ToUint64())=16897007616
  ht.PhysicalDiff(propagated_safe_time_)=4125246

  queue_.size()=0
  queue_=[]
    @     0x7f6dcdb45a1c  yb::LogFatalHandlerSink::send()
    @     0x7f6dccf43fde  google::LogMessage::SendToLog()
    @     0x7f6dccf4116a  google::LogMessage::Flush()
    @     0x7f6dccf44859  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6dd7dfe654  yb::tablet::MvccManager::AddPending()
    @     0x7f6dd7dfeded  yb::tablet::MvccManager::AddFollowerPending()
    @     0x7f6dd7de1aee  yb::tablet::Operation::AddedToFollower()
    @     0x7f6dd7de619e  yb::tablet::OperationDriver::Init()
    @     0x7f6dd7d9e9eb  yb::tablet::TabletPeer::NewOperationDriver()
    @     0x7f6dd7d9ece8  yb::tablet::TabletPeer::NewReplicaOperationDriver()
    @     0x7f6dd7d9efdd  yb::tablet::TabletPeer::StartReplicaOperation()
    @     0x7f6dd79ece46  yb::consensus::RaftConsensus::StartReplicaOperationUnlocked()
    @     0x7f6dd79f184c  yb::consensus::RaftConsensus::EnqueuePreparesUnlocked()
    @     0x7f6dd79fd3fb  yb::consensus::RaftConsensus::UpdateReplica()
    @     0x7f6dd79fe71c  yb::consensus::RaftConsensus::Update()
    @     0x7f6dd8be61cf  yb::tserver::ConsensusServiceImpl::UpdateConsensus()
    @     0x7f6dd471c3b6  _ZN2yb3rpc10HandleCallINS0_19RpcCallPBParamsImplINS_9consensus18ConsensusRequestPBENS3_19ConsensusResponsePBEEEZZNS3_18ConsensusServiceIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENKUlSt10shared_ptrINS0_11InboundCallEEE_clESF_EUlPKS4_PS5_NS0_10RpcContextEE_EEDaSF_T0_
    @     0x7f6dd471c56d  _ZNSt17_Function_handlerIFvSt10shared_ptrIN2yb3rpc11InboundCallEEEZNS1_9consensus18ConsensusServiceIf11InitMethodsERK13scoped_refptrINS1_12MetricEntityEEEUlS4_E_E9_M_invokeERKSt9_Any_dataOS4_
    @     0x7f6dd4715a94  yb::consensus::ConsensusServiceIf::Handle()
    @     0x7f6dcea94b3e  yb::rpc::ServicePoolImpl::Handle()
    @     0x7f6dcea36d44  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x7f6dceaa8978  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x7f6dcdbe8ef5  yb::Thread::SuperviseThread()
    @     0x7f6dc94c4694  start_thread
    @     0x7f6dc8c0141d  __clone
    @              (nil)  (unknown)

def- avatar Aug 11 '22 18:08 def-

Some failure traces provided by @def- from the same cluster:

  • https://gist.githubusercontent.com/mbautin/54d442d91467284395d254a121aaea94/raw (same as posted above at https://github.com/yugabyte/yugabyte-db/issues/13548#issuecomment-1212326765 ).
    • Problematic condition: !!! ht <= max_safe_time_returned_with_lease_.safe_time=1
  • https://gist.githubusercontent.com/mbautin/133d66703a6eca81ad8869ee54ed0be3/raw
    • Problematic condition: static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time.ToUint64())=0 (no strict inequality, so no !!! marker). An operation at the same hybrid time is repeatedly aborted and re-added, and then added two times in a row: AddPending { provided_ht: { physical: 1659815744075129 } final_ht: { physical: 1659815744075129 } }
  • https://gist.githubusercontent.com/mbautin/576bea07e59ea8cd45434e6c316fe7ce/raw
    • Same issue as the last one.

mbautin avatar Aug 15 '22 22:08 mbautin

Doesn't happen with enable_lease_revocation=false gflag.

def- avatar Aug 16 '22 17:08 def-

Slack thread - https://yugabyte.slack.com/archives/C4K1838GL/p1654194951287439

rthallamko3 avatar Sep 27 '22 16:09 rthallamko3

@def- btw, do you know if this was affecting 2.12, but not 2.8? would it be easy enough to try to repro on 2.8? Asking since I don't believe we backported disabling this to 2.8.

bmatican avatar Sep 27 '22 16:09 bmatican

It required very high load and a few hours of running, I'm not sure how well it reproduces. Should I try on 2.8?

def- avatar Sep 27 '22 16:09 def-

@def- if it's something you can leave running in the background, that would be very useful!

  • if it crashes on 2.8, we would need to backport the disable there as well
  • if it doesn't crash, we could use that as signal that something broke in between 2.8 and 2.12

bmatican avatar Sep 27 '22 16:09 bmatican

Sure, will attempt it now against fresh 2.8 universe: https://portal.dev.yugabyte.com/universes/17807c94-fe3f-40bd-a1ab-6db5e0f60da4/tasks

def- avatar Oct 04 '22 13:10 def-

I could not reproduce the issue on 2.8 in ~18 hours. I will keep it running for some more hours but then shut down the universe again in the interest of cost. Is there any reason why we expect this to be failing in 2.1 and 2.12, but not 2.8?

One difference I noticed is that this time the universe is all in the same region, while previously it was spread across 3 regions in the US. Could this make a difference? Also the previous universe, whether on 2.1 or 2.12, had been initially created and the tables created as well, on 2.1 version.

def- avatar Oct 05 '22 07:10 def-

I increased the load of writer threads of CassandraBatchKeyValue and got it again: (I think it was when I started another run with the same table, overwriting existing entries with 800 threads. Existing runes were going for a few hours already and keeping load high on the universe.)

[yugabyte@yb-15-dfelsing-mvcc-n15 ~]$ cat /home/yugabyte/tserver/logs/yb-tserver.FATAL.details.2022-10-05T07_51_23.pid11733.txt
F20221005 07:51:23 ../../src/yb/tablet/mvcc.cc:391] T fecee17aa0024a8ea536ff35ef44c6be P 681b0cc202714b1ca8ab8e73112d4cd1: T fecee17aa0024a8ea536ff35ef44c6be P 681b0cc202714b1ca8ab8e73112d4cd1: Recent 32 MVCC operations:
1. Replicated { ht: { physical: 1664956141029253 } op_id: 2.13255 }
2. Replicated { ht: { physical: 1664956141063296 } op_id: 2.13256 }
3. Replicated { ht: { physical: 1664956141065800 } op_id: 2.13257 }
4. Replicated { ht: { physical: 1664956141070167 } op_id: 2.13258 }
5. Replicated { ht: { physical: 1664956141316200 } op_id: 2.13259 }
6. Replicated { ht: { physical: 1664956141317478 } op_id: 2.13260 }
7. Replicated { ht: { physical: 1664956141401706 } op_id: 2.13261 }
8. Replicated { ht: { physical: 1664956141441274 } op_id: 2.13262 }
9. Replicated { ht: { physical: 1664956141453682 } op_id: 2.13263 }
10. SetPropagatedSafeTimeOnFollower { ht: { physical: 1664956141861621 logical: 4095 } }
11. Replicated { ht: { physical: 1664956141861622 } op_id: 2.13264 }
12. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956167610195 } lease: <min> } safe_time: { physical: 1664956141861622 } }
13. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956167610195 logical: 1 } lease: <min> } safe_time: { physical: 1664956141861622 } }
14. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956167615674 } lease: <min> } safe_time: { physical: 1664956141861622 } }
15. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956171026981 } lease: <min> } safe_time: { physical: 1664956141861622 } }
16. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956171687013 } lease: <min> } safe_time: { physical: 1664956141861622 } }
17. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956175161422 } lease: <min> } safe_time: { physical: 1664956141861622 } }
18. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956175366198 } lease: <min> } safe_time: { physical: 1664956141861622 } }
19. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956176121772 } lease: { physical: 1664956177366205 } } safe_time: { physical: 1664956176121772 } }
20. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956189733642 } lease: { physical: 1664956177366205 } } safe_time: { physical: 1664956177366205 } }
21. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956191031210 } lease: { physical: 1664956177366205 } } safe_time: { physical: 1664956177366205 } }
22. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956195822196 } lease: { physical: 1664956177366205 } } safe_time: { physical: 1664956177366205 } }
23. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956243060510 } lease: <min> } safe_time: { physical: 1664956177366205 } }
24. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956243060531 } lease: <min> } safe_time: { physical: 1664956177366205 } }
25. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956243484315 } lease: <min> } safe_time: { physical: 1664956177366205 } }
26. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956247089893 } lease: <min> } safe_time: { physical: 1664956177366205 } }
27. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956251424342 } lease: <min> } safe_time: { physical: 1664956177366205 } }
28. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956253016515 } lease: <min> } safe_time: { physical: 1664956177366205 } }
29. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956256328317 } lease: { physical: 1664956253424351 } } safe_time: { physical: 1664956253424351 } }
30. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956256818762 } lease: { physical: 1664956253424351 } } safe_time: { physical: 1664956253424351 } }
31. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956271148750 } lease: { physical: 1664956258328324 } } safe_time: { physical: 1664956258328324 } }
32. SafeTime { min_allowed: <min> deadline: 9223372036.855s ht_lease: { time: { physical: 1664956272276034 } lease: { physical: 1664956258328324 } } safe_time: { physical: 1664956258328324 } }
New operation's hybrid time too low: { physical: 1664956243259606 }, op id: 6.13267
  max_safe_time_returned_with_lease_={ safe_time: { physical: 1664956258328324 } source: kHybridTimeLease }
  !!! ht <= max_safe_time_returned_with_lease_.safe_time=1
  static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=-61721468928
  ht.PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=-15068718

  max_safe_time_returned_without_lease_={ safe_time: <min> source: kUnknown }
  ht <= max_safe_time_returned_without_lease_.safe_time=0
  static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=6819660772391346176
  ht.PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=1664956243259606

  max_safe_time_returned_for_follower_={ safe_time: <min> source: kUnknown }
  ht <= max_safe_time_returned_for_follower_.safe_time=0
  static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=6819660772391346176
  ht.PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1664956243259606

  last_replicated_={ physical: 1664956141861622 }
  ht <= last_replicated_=0
  static_cast<int64_t>(ht.ToUint64() - last_replicated_.ToUint64())=415326142464
  ht.PhysicalDiff(last_replicated_)=101397984

  last_ht_in_queue=<min>
  ht <= last_ht_in_queue=0
  static_cast<int64_t>(ht.ToUint64() - last_ht_in_queue.ToUint64())=6819660772391346176
  ht.PhysicalDiff(last_ht_in_queue)=1664956243259606

  propagated_safe_time_={ physical: 1664956141861621 logical: 4095 }
  ht <= propagated_safe_time_=0
  static_cast<int64_t>(ht.ToUint64() - propagated_safe_time_.ToUint64())=415326142465
  ht.PhysicalDiff(propagated_safe_time_)=101397985

  queue_.size()=0
  queue_=[]
    @     0x7f71df4acffc  yb::LogFatalHandlerSink::send()
    @     0x7f71dcc318ee  google::LogMessage::SendToLog()
    @     0x7f71dcc2ea7a  google::LogMessage::Flush()
    @     0x7f71dcc32169  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f71e82bcc74  yb::tablet::MvccManager::AddPending()
    @     0x7f71e82bd40d  yb::tablet::MvccManager::AddFollowerPending()
    @     0x7f71e829fc44  yb::tablet::Operation::AddedToFollower()
    @     0x7f71e82a40f1  yb::tablet::OperationDriver::Init()
    @     0x7f71e825bfcb  yb::tablet::TabletPeer::NewOperationDriver()
    @     0x7f71e825c148  yb::tablet::TabletPeer::NewReplicaOperationDriver()
    @     0x7f71e825c421  yb::tablet::TabletPeer::StartReplicaOperation()
    @     0x7f71e7e8da46  yb::consensus::RaftConsensus::StartReplicaOperationUnlocked()
    @     0x7f71e7e91fcc  yb::consensus::RaftConsensus::EnqueuePreparesUnlocked()
    @     0x7f71e7e9e4f1  yb::consensus::RaftConsensus::UpdateReplica()
    @     0x7f71e7e9f6ec  yb::consensus::RaftConsensus::Update()
    @     0x7f71e8dfa5ae  yb::tserver::ConsensusServiceImpl::UpdateConsensus()
    @     0x7f71e4db2f17  yb::consensus::ConsensusServiceIf::Handle()
    @     0x7f71df948689  yb::rpc::ServicePoolImpl::Handle()
    @     0x7f71df8ebaa4  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x7f71df95abf8  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x7f71df547bbf  yb::Thread::SuperviseThread()
    @     0x7f71d97e9694  start_thread
    @     0x7f71d8f2641d  __clone
    @              (nil)  (unknown)

Edit: I have deleted the universe now, can't pause it because of 4 x local disk.

def- avatar Oct 05 '22 08:10 def-

Thanks @def- , my bad, so this is present across all releases then! cc @rthallamko3 @robertsami Should we then disable lease revocation for 2.8 as well?

bmatican avatar Oct 05 '22 15:10 bmatican

I tried reproducing this locally on 1 node / 3 node but didn't manage: https://gist.github.com/def-/8f16c5f6966ac6136fd65c578b68c9c9 It was difficult even on 18 node to run into this and took a few hours, I'm not sure how to trigger it easier.

def- avatar Oct 27 '22 12:10 def-

I haven't seen this again with @Huqicheng's fix: https://github.com/yugabyte/yugabyte-db/commit/a4d02ef6c65885e227eb1c61a4d8c81cbcf39d3c

So closing the bug as fixed.

def- avatar Nov 17 '22 09:11 def-