yugabyte-db
yugabyte-db copied to clipboard
[Xcluster][Packed Columns][Upgrade] FATAL files, EOF error observed on attempting xcluster replication
Jira Link: DB-3577
Description
Note: This issue is reproducible if only YSQL packed columns is enabled on master and tserver!
- Create a 2.12.2.0-b34 source and target universe with packed columns enabled in YSQL(master and tserver) and YCQL (tserver only)
- Set up identical schema, database and table in the source and target.
- Setup bidirectinal replication
- Insert data in source and target and verify bidirectional replication
- Update data in source and target and verify bidirectional replication
- Upgrade the database to 2.15.3.0-b151
- Insert data in the source cluster
- Verify replication in target upon insert Issue: On verifying replication in the target,
The following FATAL files are generated in tserver:
F20220919 13:26:10 ../../src/yb/docdb/primitive_value.cc:1267] Invalid value of enum ValueType (full enum type: yb::docdb::ValueType, expression: value_type): 122.
@ 0x7f0b08f9282c yb::LogFatalHandlerSink::send()
@ 0x7f0b08390fde google::LogMessage::SendToLog()
@ 0x7f0b0838e16a google::LogMessage::Flush()
@ 0x7f0b08391859 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f0b08f61efc yb::FatalInvalidEnumValueInternal()
@ 0x7f0b0e188f70 yb::docdb::PrimitiveValue::DecodeFromValue()
@ 0x7f0b121c6a8a yb::docdb::Value::Decode()
@ 0x7f0b121bbad3 yb::docdb::(anonymous namespace)::DocDbRowData::CurrentRow()
@ 0x7f0b121bde8b yb::docdb::SubDocumentReader::Get()
@ 0x7f0b1214519a yb::docdb::DocDBTableReader::Get()
@ 0x7f0b121664e3 yb::docdb::DocRowwiseIterator::HasNext()
@ 0x7f0b1219980a yb::docdb::PgsqlReadOperation::ExecuteScalar()
@ 0x7f0b1219b96f yb::docdb::PgsqlReadOperation::Execute()
@ 0x7f0b130d9172 yb::tablet::AbstractTablet::HandlePgsqlReadRequest()
@ 0x7f0b13108c23 yb::tablet::Tablet::HandlePgsqlReadRequest()
@ 0x7f0b13f0f5b6 yb::tserver::TabletServiceImpl::DoReadImpl()
@ 0x7f0b13f106ec yb::tserver::TabletServiceImpl::DoRead()
@ 0x7f0b13f10aac yb::tserver::TabletServiceImpl::CompleteRead()
@ 0x7f0b13f12a4d yb::tserver::TabletServiceImpl::Read()
@ 0x7f0b10239da6 _ZN2yb3rpc10HandleCallINS0_19RpcCallPBParamsImplINS_7tserver13ReadRequestPBENS3_14ReadResponsePBEEEZZNS3_21TabletServerServiceIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENKUlSt10shared_ptrINS0_11InboundCallEEE0_clESF_EUlPKS4_PS5_NS0_10RpcContextEE_EEDaSF_T0_
@ 0x7f0b10239f5d _ZNSt17_Function_handlerIFvSt10shared_ptrIN2yb3rpc11InboundCallEEEZNS1_7tserver21TabletServerServiceIf11InitMethodsERK13scoped_refptrINS1_12MetricEntityEEEUlS4_E0_E9_M_invokeERKSt9_Any_dataOS4_
@ 0x7f0b1022f304 yb::tserver::TabletServerServiceIf::Handle()
@ 0x7f0b09ee09ae yb::rpc::ServicePoolImpl::Handle()
@ 0x7f0b09e82b44 yb::rpc::InboundCall::InboundCallTask::Run()
@ 0x7f0b09ef46f8 yb::rpc::(anonymous namespace)::Worker::Execute()
@ 0x7f0b09035c45 yb::Thread::SuperviseThread()
@ 0x7f0b04911694 start_thread
@ 0x7f0b0404e41d __clone
@ (nil) (unknown)
CC: @lingamsandeep
https://phabricator.dev.yugabyte.com/D16927 is not present in 2.12.2.0-b34. We will use a 2.12 version with this fix and retry this test case to see if this FATAL is still observed.
This was re-executed with 2.12.9.0-b3 as the seed version(The version which contains the fix- https://phabricator.dev.yugabyte.com/D16927) and yet we are consistently running into this issue. This needs to be investigated. CC: @def- , @rthallamko3 , @lingamsandeep
@kripasreenivasan have we tried running the same test, without xcluster? it seems like the crash is on the master, on the target, but that's strange, as the master shouldn't be involved in xcluster, at runtime...
So I wonder if we could repro the same issue, just in a single cluster setting, but maybe forcing the master to write some more data (eg: running a DDL / restarting a tserver, to write some heartbeats) and then forcing a master compaction, so it has to pack some rows on disk.
Also, the error shows up on a new YSQL connection, so perhaps you just never connected to the source, but only the target, that's why only the target crashed...
Without xcluster this looks like a simple upgrade test, which we are running and haven't seen this in. The FATAL seems to be on target side, not source, so without xcluster I wouldn't expect to see it.
@lingamsandeep , Would your fix resolve this issue as well?
@rthallamko3 Possibly. @def- , @kripasreenivasan - Is it possible to repro this issue again and check what the schema_version of the tables are on both the source and target universe. If they are different and we are using packed rows, then yes, my fix should address it.
@kripasreenivasan , Can we retest this on latest master build? We want to understand if there is any other issue beyond the one that Sandeep fixed.
Executed the above scenario with YSQL packed columns only enabled and could not reproduce it post @lingamsandeep 's fix.
Looks like it is taken care of.
This issue is no longer observed even in the case of YCQL packed columns.
Deterministic repro available with automation script testxclusterwithupgrade-aws-rf3-upgrade-2.12.10.0-b41
@spolitov , This scenario fails with PackedRow feature enabled, but doesn't repro with PackedRow disabled. Do you think the underlying failure - Invalid enum on the doc reader could be a core packed row issue or bi-directional xcluster related?
F20221215 16:53:38 ../../src/yb/docdb/primitive_value.cc:1266] Invalid value of enum ValueType (full enum type: yb::docdb::ValueType, expression: value_type): 122.
@ 0x7f2204b3225c yb::LogFatalHandlerSink::send()
@ 0x7f2203f33fde google::LogMessage::SendToLog()
@ 0x7f2203f3116a google::LogMessage::Flush()
@ 0x7f2203f34859 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f2204b0190c yb::FatalInvalidEnumValueInternal()
@ 0x7f2209d70d90 yb::docdb::PrimitiveValue::DecodeFromValue()
@ 0x7f220ddd964a yb::docdb::Value::Decode()
@ 0x7f220ddce693 yb::docdb::(anonymous namespace)::DocDbRowData::CurrentRow()
@ 0x7f220ddd0a4b yb::docdb::SubDocumentReader::Get()
@ 0x7f220dd5771a yb::docdb::DocDBTableReader::Get()
@ 0x7f220dd78c63 yb::docdb::DocRowwiseIterator::HasNext()
@ 0x7f220ddabaea yb::docdb::PgsqlReadOperation::ExecuteScalar()
@ 0x7f220ddade1f yb::docdb::PgsqlReadOperation::Execute()
@ 0x7f220ecfdb92 yb::tablet::AbstractTablet::HandlePgsqlReadRequest()
@ 0x7f220ed2d233 yb::tablet::Tablet::HandlePgsqlReadRequest()
@ 0x7f220fbb2b06 yb::tserver::TabletServiceImpl::DoReadImpl()
@ 0x7f220fbb3c3c yb::tserver::TabletServiceImpl::DoRead()
@ 0x7f220fbb3ffc yb::tserver::TabletServiceImpl::CompleteRead()
@ 0x7f220fbb641f yb::tserver::TabletServiceImpl::Read()
@kripasreenivasan , Does this still repro? I am wondering if the schema change related fixes that were made in Jan have taken care of this.
Notes from @kripasreenivasan ,
- This is a YSQL packed columns only issue. Ie the same test with packed columns disabled passes
- This is the test run of master-b63 http://release.dev.yugabyte.com/tests/18375/log. The ticket has the test name testxclusterwithupgradeandpackedcolumns. If you search for this name in this link you will find the test result and the fact that it is still failing in in the point mentioned in the bug.
- This testcase continues to fail with this bug even with the latest available master build- b91
@lingamsandeep has a diff for colocated + YSQL packed + xcluster: https://phabricator.dev.yugabyte.com/D23487, but my understanding was that for non-colocated, we should be good right now, so this is a bit strange.
Confirming that the automation test case uses a non-colocated schema.
Attempted this on a unidirectional replication scenario and I was able to reproduce the bug On verifying replication in the target post insert:
cl_db_001=# select count(*) from t_01 ;
FATAL: terminating connection due to administrator command
ERROR: Aborted: Failed to schedule: 0x0000000002559810 -> Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none])
Summarizing new steps attempted:
- Create a 2.12.10.0-b41 source and target universe
- Set up identical schema, database and table in the source and target. Setup unidirectional replication from source to target
create database cl_db_001;
\c cl_db_001
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TYPE complex AS (re float8, im float8);
CREATE TYPE e_details AS ENUM ('Email', 'Sms', 'Phone');
CREATE DOMAIN postal_code AS TEXT CHECK(VALUE ~ '^\d{5}$'OR VALUE ~ '^\d{5}-\d{4}$');
CREATE TABLE t_01 (id TEXT PRIMARY KEY, uuid_col uuid DEFAULT uuid_generate_v4(), name text COLLATE "en_US.utf8", c complex, info json, contact JSONB, arr smallint[], cash money, i inet, m macaddr, val smallint, age int, collated_data text collate "POSIX",date DATE, n NUMERIC (3, 2),r real, c1 CHAR(1), created_at timestamptz, uuid0 uuid DEFAULT uuid_nil(), uuid1 uuid DEFAULT uuid_generate_v1(), p1 POINT,t1 TIME,ts1 TIMESTAMP,i4 INTERVAL,p2 path, p3 polygon,b box, c2 circle, l line, l1 lseg, a2 text[][], zip postal_code, fk_id text, details e_details);
Setup unidirectional replication from Src to Target
bin/yb-admin --master_addresses 172.151.31.238,172.151.46.127,172.151.62.188 setup_universe_replication 905bc1b0-d497-4a7a-a54b-872ae01beec1_src_to_target_01 172.151.30.76,172.151.40.132,172.151.55.223 0000400000003000800000000000401a
- Insert data in source and target and verify replication
DO
$do$
BEGIN
FOR i IN 1..60000 LOOP
EXECUTE format('INSERT INTO t_01(id, name, c, info, contact, arr, cash, i, m,val,age,collated_data, date,n,r,c1,created_at,p1,t1,ts1,i4,p2,p3,c2,l,l1,a2,zip,details) VALUES (%s,''Snowy'',(1,2), ''{"Snowy": "John Doe", "items": {"product": "Laptop","qty": 6}}'',''{"Snowy":[ {"type": "mobile", "phone": "001001"}, {"type": "fix", "phone": "002002"}]}'', ''{1, 2, 3}'', ''$99.99'' ,''1.1.1.1'', ''00:00:00:00:00:00'',666,80000, md5(random()::text), current_timestamp,5.36,2147483647,''G'',now(), point(3, 4),LOCALTIME(0), current_timestamp(0), ''00:25:00'', ''(1,3), (4,12)'',''(1,3),(4,12), (2,4)'',''10, 4, 10'',''{1,2,3}'',''(0, 4), (2,8)'',''{{"b1", "c"}, {"m1", "l"}}'',89000,''Sms'');',i);
END LOOP;
END
$do$;
- Upgrade source universe to 2.17.3.0-b55 from the platform UI.
- Delete the replication
bin/yb-admin --master_addresses 172.151.31.238,172.151.46.127,172.151.62.188 delete_universe_replication 905bc1b0-d497-4a7a-a54b-872ae01beec1_src_to_target_01
- Turned on YSQL packed storage in the source universe in master and tserver
- Setup unidirectional replication from Src to Target
bin/yb-admin --master_addresses 172.151.31.238,172.151.46.127,172.151.62.188 setup_universe_replication 905bc1b0-d497-4a7a-a54b-872ae01beec1_src_to_target_01 172.151.30.76,172.151.40.132,172.151.55.223 0000400000003000800000000000401a
- Insert data in the table
DO
$do$
BEGIN
FOR i IN 60001..90000 LOOP
EXECUTE format('INSERT INTO t_01(id, name, c, info, contact, arr, cash, i, m,val,age,collated_data, date,n,r,c1,created_at,p1,t1,ts1,i4,p2,p3,c2,l,l1,a2,zip,details) VALUES (%s,''Snowy'',(1,2), ''{"Snowy": "John Doe", "items": {"product": "Laptop","qty": 6}}'',''{"Snowy":[ {"type": "mobile", "phone": "001001"}, {"type": "fix", "phone": "002002"}]}'', ''{1, 2, 3}'', ''$99.99'' ,''1.1.1.1'', ''00:00:00:00:00:00'',666,80000, md5(random()::text), current_timestamp,5.36,2147483647,''G'',now(), point(3, 4),LOCALTIME(0), current_timestamp(0), ''00:25:00'', ''(1,3), (4,12)'',''(1,3),(4,12), (2,4)'',''10, 4, 10'',''{1,2,3}'',''(0, 4), (2,8)'',''{{"b1", "c"}, {"m1", "l"}}'',89000,''Sms'');',i);
END LOOP;
END
$do$;
- Executed a select on the target universe table
db_01=# select count(*) from t_01;
FATAL: terminating connection due to administrator command
ERROR: Aborted: Failed to schedule: 0x0000000002a80010 -> Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none])
select * from t_01 limit 1;
FATAL: terminating connection due to administrator command
ERROR: Aborted: Failed to schedule: 0x00000000033ddb10 -> Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none])
CC: @lingamsandeep , @rthallamko3
Target tserver has the following FATAL
F0322 10:23:13.514407 20987 primitive_value.cc:1266] Invalid value of enum ValueType (full enum type: yb::docdb::ValueType, expression: value_type): 122.
F0322 10:23:13.514407 20994 primitive_value.cc:1266] Invalid value of enum ValueType (full enum type: yb::docdb::ValueType, expression: value_type): 122.
Logs: yb-support-bundle-ksreenivasan-target-14083-20230322102737.689-logs.tar.gz yb-support-bundle-ksreenivasan-src-14083-20230322102727.547-logs.tar.gz
For XCluster + Packed columns, support was only added in 2.16. Given that the recommended sequence for upgrading from 2.12 -> 2.16+ is
- Delete any existing replication between source and target
- Upgrade both source and target universes to 2.16+
- Turn ON packed on both source and target
- Re-enable replication.
Given that this is an unsupported scenario, resolving this issue as by-design.