yugabyte-db
yugabyte-db copied to clipboard
[DocDB] Colocated table + Packed Rows: DML Workload and compaction cannot find schema packing and fail.
Jira Link: DB-10146
Description
Tried on version: 2.21.1.0-b158
Encountering the following fatal error again on 2.21.1.0-b158(This has Jonathan’s fix for https://github.com/yugabyte/yugabyte-db/issues/20638) in a new Cross-DB DDL test. Note: I believe this time it’s occurring on a table named ‘tb_0_temp_old,’ which is likely created internally during the execution of certain DDLs.
F20240227 13:30:07 ../../src/yb/tablet/tablet.cc:1522] T f9ecba67fd0d44c78ef87b691e6647fc P d79ad410277a49fa85364466087c2ea2: Failed to write a batch with 0 operations into RocksDB: Corruption (yb/tablet/tablet_metadata.cc:354): Cannot find packing for table: 00004022000030008000000000004223, schema version: 0
@ 0x55debc004907 google::LogMessage::SendToLog()
@ 0x55debc00583d google::LogMessage::Flush()
@ 0x55debc005e89 google::LogMessageFatal::~LogMessageFatal()
@ 0x55debd3b9b73 yb::tablet::Tablet::WriteToRocksDB()
@ 0x55debd3b5b45 yb::tablet::Tablet::ApplyIntents()
@ 0x55debd3b6712 yb::tablet::Tablet::ApplyIntents()
@ 0x55debd46e416 yb::tablet::TransactionParticipant::Impl::ProcessReplicated()
@ 0x55debd38fffc yb::tablet::UpdateTxnOperation::DoReplicated()
@ 0x55debd3835fe yb::tablet::Operation::Replicated()
@ 0x55debd385a7f yb::tablet::OperationDriver::ReplicationFinished()
@ 0x55debc4a7e2b yb::consensus::ConsensusRound::NotifyReplicationFinished()
@ 0x55debc4f68ff yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
@ 0x55debc4f5c69 yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
@ 0x55debc4ddae0 yb::consensus::RaftConsensus::UpdateReplica()
@ 0x55debc4bf373 yb::consensus::RaftConsensus::Update()
@ 0x55debd6bf323 yb::tserver::ConsensusServiceImpl::UpdateConsensus()
@ 0x55debc54d5fe std::__1::__function::__func<>::operator()()
@ 0x55debc54e22f yb::consensus::ConsensusServiceIf::Handle()
@ 0x55debd2db56f yb::rpc::ServicePoolImpl::Handle()
@ 0x55debd1fbbbf yb::rpc::InboundCall::InboundCallTask::Run()
@ 0x55debd2eb3f3 yb::rpc::(anonymous namespace)::Worker::Execute()
@ 0x55debdae4913 yb::Thread::SuperviseThread()
@ 0x7fda53c551ca start_thread
@ 0x7fda53ea6e73 __GI___clone
Test Details:
So this occurred in 2nd iteration of Step 3. So we did execute a Backup Restore on on database(postgres_20)
1. Start the cross DB DDL workload which will execute DDLs and DMLs across databases concurrently (20 colocated database and 20 non-colocated database), run this for 20-30 mins
2. Create a PITR schedule on 10 random database
3. Start a while loop which executed
a. Note down time for PITR(0)
b. Create a backup of 1 random database
c. Start the cross DB DDL workload and stop it after 10 mins
d. Note down the time for PITR(1)
e. Start the cross DB DDL workload and keep it running
f. Execute PITR on all 10 databases at random times(Between 1-9 sec ago) while the workload is running.
g. Wait for the workload to stop
h. Restore to PITR(1)
i. Validate data
j. Restore to PITR(0) with a probability of 0.6 and validate data
k. Delete the PITR schedule for the backup db (In our case it was postgres_20)
l. Drop the database
m. Restore the backup
n. Create the snapshot schedule for this new DB
List of DDLs in sample app
private static List<List<String>> ddlList = List.of(
List.of("CREATE INDEX idx1 ON ? (k)", "DROP INDEX idx1"),
List.of("CREATE TABLE tempTable1 AS SELECT * FROM ? limit 1000000", "ALTER TABLE tempTable1 RENAME TO tempTable1_new", "DROP TABLE tempTable1_new"),
List.of("CREATE MATERIALIZED VIEW mv1 as SELECT k from ? limit 10000", "REFRESH MATERIALIZED VIEW mv1", "DROP MATERIALIZED VIEW mv1"),
List.of("ALTER TABLE ? ADD newColumn1 TEXT DEFAULT 'dummyString'", "ALTER TABLE ? DROP newColumn1"),
List.of("ALTER TABLE ? ADD newColumn2 TEXT NULL", "ALTER TABLE ? DROP newColumn2"),
List.of("CREATE VIEW view1_? AS SELECT k from ?", "DROP VIEW view1_?"),
List.of("ALTER TABLE ? ADD newColumn3 TEXT DEFAULT 'dummyString'", "ALTER TABLE ? ALTER newColumn3 TYPE VARCHAR(1000)", "ALTER TABLE ? DROP newColumn3"),
List.of("CREATE TABLE tempTable2 AS SELECT * FROM ? limit 1000000", "CREATE INDEX idx2 ON tempTable2(k)", "ALTER TABLE ? ADD newColumn4 TEXT DEFAULT 'dummyString'", "ALTER TABLE tempTable2 ADD newColumn2 TEXT DEFAULT 'dummyString'", "TRUNCATE table ? cascade", "ALTER TABLE ? DROP newColumn4", "ALTER TABLE tempTable2 DROP newColumn2", "DROP INDEX idx2", "DROP TABLE tempTable2"),
List.of("CREATE VIEW view2_? AS SELECT k from ?", "CREATE MATERIALIZED VIEW mv2 as SELECT k from ? limit 10000", "REFRESH MATERIALIZED VIEW mv2", "DROP MATERIALIZED VIEW mv2", "DROP VIEW view2_?")
);
Logs: http://stress.dev.yugabyte.com/stress_test/e56e3d49-de37-4a93-91d8-fe7493f6c4d1 (Attachments -> Universe logs)
G-flags
tserver_gflags={
"ysql_enable_packed_row": "true",
"ysql_enable_packed_row_for_colocated_table": "true",
"enable_automatic_tablet_splitting": "true",
"ysql_max_connections": "500",
'client_read_write_timeout_ms': str(30 * 60 * 1000),
'yb_client_admin_operation_timeout_sec': str(30 * 60),
"consistent_restore": "true",
"ysql_enable_db_catalog_version_mode": "true",
"allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode",
"tablet_replicas_per_gib_limit": 0
},
master_gflags={
"ysql_enable_packed_row": "true",
"ysql_enable_packed_row_for_colocated_table": "true",
"enable_automatic_tablet_splitting": "true",
"tablet_split_high_phase_shard_count_per_node": 20000,
"tablet_split_high_phase_size_threshold_bytes": 2097152, # 2MB
# low_phase_size 100KB
"tablet_split_low_phase_size_threshold_bytes": 102400, # 100 KB
"tablet_split_low_phase_shard_count_per_node": 10000,
"consistent_restore": "true",
"ysql_enable_db_catalog_version_mode": "true",
"allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode",
"tablet_replicas_per_gib_limit": 0
}
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
- [X] I confirm this issue does not contain any sensitive information.
Similar issue https://github.com/yugabyte/yugabyte-db/issues/20638
yb-tserver.ip-172-151-18-254.us-west-2.compute.internal.yugabyte.log.INFO.20240227-133202.273248.gz:I0227 13:34:04.453986 273608 doc_read_context.cc:64] TBL 0000401b000030008000000000004226 T 00c52216ed3341aab818c29f44ec7f4c P d79ad410277a49fa85364466087c2ea2: DocReadContext, copy and filter: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33] => [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], min_schema_version: 21
...
yb-tserver.ip-172-151-18-254.us-west-2.compute.internal.yugabyte.log.INFO.20240227-133202.273248.gz:I0227 13:34:08.325945 277681 doc_read_context.cc:78] TBL 0000401b000030008000000000004226 T 00c52216ed3341aab818c29f44ec7f4c P d79ad410277a49fa85364466087c2ea2: LogAfterMerge: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], overwrite: 0
yb-tserver.ip-172-151-18-254.us-west-2.compute.internal.yugabyte.log.INFO.20240227-133202.273248.gz:W0227 13:34:08.771888 273608 db_impl.cc:3817] T 00c52216ed3341aab818c29f44ec7f4c P d79ad410277a49fa85364466087c2ea2 [R]: Compaction error: Corruption (yb/tablet/tablet_metadata.cc:354): Cannot find packing for table: 0000401b000030008000000000004226, schema version: 16
16 existing right after Restore
From the tablet meta, 16 non existing
yb-tserver.ip-172-151-18-254.us-west-2.compute.internal.yugabyte.log.INFO.20240227-133502.283758.gz:I0227 13:35:02.566885 283861 doc_read_context.cc:74] TBL 0000401b000030008000000000004226 T 00c52216ed3341aab818c29f44ec7f4c P d79ad410277a49fa85364466087c2ea2: LogAfterLoad: [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]
Using this issue to track the disabling of Packed Row + Colocation. https://github.com/yugabyte/yugabyte-db/issues/21244 will track the real fix.