yugabyte-db icon indicating copy to clipboard operation
yugabyte-db copied to clipboard

[Xcluster][Packed Columns][Upgrade] FATAL files, EOF error observed on attempting xcluster replication

Open kripasreenivasan opened this issue 3 years ago • 2 comments
trafficstars

Jira Link: DB-3577

Description

Note: This issue is reproducible if only YSQL packed columns is enabled on master and tserver!

  • Create a 2.12.2.0-b34 source and target universe with packed columns enabled in YSQL(master and tserver) and YCQL (tserver only)
  • Set up identical schema, database and table in the source and target.
  • Setup bidirectinal replication
  • Insert data in source and target and verify bidirectional replication
  • Update data in source and target and verify bidirectional replication
  • Upgrade the database to 2.15.3.0-b151
  • Insert data in the source cluster
  • Verify replication in target upon insert Issue: On verifying replication in the target,

The following FATAL files are generated in tserver:

F20220919 13:26:10 ../../src/yb/docdb/primitive_value.cc:1267] Invalid value of enum ValueType (full enum type: yb::docdb::ValueType, expression: value_type): 122.
    @     0x7f0b08f9282c  yb::LogFatalHandlerSink::send()
    @     0x7f0b08390fde  google::LogMessage::SendToLog()
    @     0x7f0b0838e16a  google::LogMessage::Flush()
    @     0x7f0b08391859  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f0b08f61efc  yb::FatalInvalidEnumValueInternal()
    @     0x7f0b0e188f70  yb::docdb::PrimitiveValue::DecodeFromValue()
    @     0x7f0b121c6a8a  yb::docdb::Value::Decode()
    @     0x7f0b121bbad3  yb::docdb::(anonymous namespace)::DocDbRowData::CurrentRow()
    @     0x7f0b121bde8b  yb::docdb::SubDocumentReader::Get()
    @     0x7f0b1214519a  yb::docdb::DocDBTableReader::Get()
    @     0x7f0b121664e3  yb::docdb::DocRowwiseIterator::HasNext()
    @     0x7f0b1219980a  yb::docdb::PgsqlReadOperation::ExecuteScalar()
    @     0x7f0b1219b96f  yb::docdb::PgsqlReadOperation::Execute()
    @     0x7f0b130d9172  yb::tablet::AbstractTablet::HandlePgsqlReadRequest()
    @     0x7f0b13108c23  yb::tablet::Tablet::HandlePgsqlReadRequest()
    @     0x7f0b13f0f5b6  yb::tserver::TabletServiceImpl::DoReadImpl()
    @     0x7f0b13f106ec  yb::tserver::TabletServiceImpl::DoRead()
    @     0x7f0b13f10aac  yb::tserver::TabletServiceImpl::CompleteRead()
    @     0x7f0b13f12a4d  yb::tserver::TabletServiceImpl::Read()
    @     0x7f0b10239da6  _ZN2yb3rpc10HandleCallINS0_19RpcCallPBParamsImplINS_7tserver13ReadRequestPBENS3_14ReadResponsePBEEEZZNS3_21TabletServerServiceIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENKUlSt10shared_ptrINS0_11InboundCallEEE0_clESF_EUlPKS4_PS5_NS0_10RpcContextEE_EEDaSF_T0_
    @     0x7f0b10239f5d  _ZNSt17_Function_handlerIFvSt10shared_ptrIN2yb3rpc11InboundCallEEEZNS1_7tserver21TabletServerServiceIf11InitMethodsERK13scoped_refptrINS1_12MetricEntityEEEUlS4_E0_E9_M_invokeERKSt9_Any_dataOS4_
    @     0x7f0b1022f304  yb::tserver::TabletServerServiceIf::Handle()
    @     0x7f0b09ee09ae  yb::rpc::ServicePoolImpl::Handle()
    @     0x7f0b09e82b44  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x7f0b09ef46f8  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x7f0b09035c45  yb::Thread::SuperviseThread()
    @     0x7f0b04911694  start_thread
    @     0x7f0b0404e41d  __clone
    @              (nil)  (unknown)

CC: @lingamsandeep

kripasreenivasan avatar Sep 20 '22 09:09 kripasreenivasan

https://phabricator.dev.yugabyte.com/D16927 is not present in 2.12.2.0-b34. We will use a 2.12 version with this fix and retry this test case to see if this FATAL is still observed.

kripasreenivasan avatar Sep 20 '22 14:09 kripasreenivasan

This was re-executed with 2.12.9.0-b3 as the seed version(The version which contains the fix- https://phabricator.dev.yugabyte.com/D16927) and yet we are consistently running into this issue. This needs to be investigated. CC: @def- , @rthallamko3 , @lingamsandeep

kripasreenivasan avatar Sep 22 '22 12:09 kripasreenivasan

@kripasreenivasan have we tried running the same test, without xcluster? it seems like the crash is on the master, on the target, but that's strange, as the master shouldn't be involved in xcluster, at runtime...

So I wonder if we could repro the same issue, just in a single cluster setting, but maybe forcing the master to write some more data (eg: running a DDL / restarting a tserver, to write some heartbeats) and then forcing a master compaction, so it has to pack some rows on disk.

Also, the error shows up on a new YSQL connection, so perhaps you just never connected to the source, but only the target, that's why only the target crashed...

bmatican avatar Oct 12 '22 14:10 bmatican

Without xcluster this looks like a simple upgrade test, which we are running and haven't seen this in. The FATAL seems to be on target side, not source, so without xcluster I wouldn't expect to see it.

def- avatar Oct 21 '22 11:10 def-

@lingamsandeep , Would your fix resolve this issue as well?

rthallamko3 avatar Oct 24 '22 21:10 rthallamko3

@rthallamko3 Possibly. @def- , @kripasreenivasan - Is it possible to repro this issue again and check what the schema_version of the tables are on both the source and target universe. If they are different and we are using packed rows, then yes, my fix should address it.

lingamsandeep avatar Oct 24 '22 22:10 lingamsandeep

@kripasreenivasan , Can we retest this on latest master build? We want to understand if there is any other issue beyond the one that Sandeep fixed.

rthallamko3 avatar Oct 29 '22 00:10 rthallamko3

Executed the above scenario with YSQL packed columns only enabled and could not reproduce it post @lingamsandeep 's fix.

kripasreenivasan avatar Nov 28 '22 09:11 kripasreenivasan

Looks like it is taken care of.

rthallamko3 avatar Nov 28 '22 13:11 rthallamko3

This issue is no longer observed even in the case of YCQL packed columns.

kripasreenivasan avatar Nov 29 '22 06:11 kripasreenivasan

Deterministic repro available with automation script testxclusterwithupgrade-aws-rf3-upgrade-2.12.10.0-b41

kripasreenivasan avatar Dec 14 '22 12:12 kripasreenivasan

@spolitov , This scenario fails with PackedRow feature enabled, but doesn't repro with PackedRow disabled. Do you think the underlying failure - Invalid enum on the doc reader could be a core packed row issue or bi-directional xcluster related?


F20221215 16:53:38 ../../src/yb/docdb/primitive_value.cc:1266] Invalid value of enum ValueType (full enum type: yb::docdb::ValueType, expression: value_type): 122.
    @     0x7f2204b3225c  yb::LogFatalHandlerSink::send()
    @     0x7f2203f33fde  google::LogMessage::SendToLog()
    @     0x7f2203f3116a  google::LogMessage::Flush()
    @     0x7f2203f34859  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f2204b0190c  yb::FatalInvalidEnumValueInternal()
    @     0x7f2209d70d90  yb::docdb::PrimitiveValue::DecodeFromValue()
    @     0x7f220ddd964a  yb::docdb::Value::Decode()
    @     0x7f220ddce693  yb::docdb::(anonymous namespace)::DocDbRowData::CurrentRow()
    @     0x7f220ddd0a4b  yb::docdb::SubDocumentReader::Get()
    @     0x7f220dd5771a  yb::docdb::DocDBTableReader::Get()
    @     0x7f220dd78c63  yb::docdb::DocRowwiseIterator::HasNext()
    @     0x7f220ddabaea  yb::docdb::PgsqlReadOperation::ExecuteScalar()
    @     0x7f220ddade1f  yb::docdb::PgsqlReadOperation::Execute()
    @     0x7f220ecfdb92  yb::tablet::AbstractTablet::HandlePgsqlReadRequest()
    @     0x7f220ed2d233  yb::tablet::Tablet::HandlePgsqlReadRequest()
    @     0x7f220fbb2b06  yb::tserver::TabletServiceImpl::DoReadImpl()
    @     0x7f220fbb3c3c  yb::tserver::TabletServiceImpl::DoRead()
    @     0x7f220fbb3ffc  yb::tserver::TabletServiceImpl::CompleteRead()
    @     0x7f220fbb641f  yb::tserver::TabletServiceImpl::Read()
   

rthallamko3 avatar Mar 06 '23 16:03 rthallamko3

@kripasreenivasan , Does this still repro? I am wondering if the schema change related fixes that were made in Jan have taken care of this.

rthallamko3 avatar Mar 08 '23 02:03 rthallamko3

Notes from @kripasreenivasan ,

  • This is a YSQL packed columns only issue. Ie the same test with packed columns disabled passes
  • This is the test run of master-b63 http://release.dev.yugabyte.com/tests/18375/log. The ticket has the test name testxclusterwithupgradeandpackedcolumns. If you search for this name in this link you will find the test result and the fact that it is still failing in in the point mentioned in the bug.
  • This testcase continues to fail with this bug even with the latest available master build- b91

rthallamko3 avatar Mar 15 '23 19:03 rthallamko3

@lingamsandeep has a diff for colocated + YSQL packed + xcluster: https://phabricator.dev.yugabyte.com/D23487, but my understanding was that for non-colocated, we should be good right now, so this is a bit strange.

bmatican avatar Mar 15 '23 19:03 bmatican

Confirming that the automation test case uses a non-colocated schema.

kripasreenivasan avatar Mar 16 '23 04:03 kripasreenivasan

Attempted this on a unidirectional replication scenario and I was able to reproduce the bug On verifying replication in the target post insert:

cl_db_001=# select count(*) from t_01 ;
FATAL:  terminating connection due to administrator command
ERROR:  Aborted: Failed to schedule: 0x0000000002559810 -> Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none])

kripasreenivasan avatar Mar 22 '23 08:03 kripasreenivasan

Summarizing new steps attempted:

  • Create a 2.12.10.0-b41 source and target universe
  • Set up identical schema, database and table in the source and target. Setup unidirectional replication from source to target
create database cl_db_001;
\c cl_db_001
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TYPE complex AS (re float8, im float8);
CREATE TYPE e_details AS ENUM ('Email', 'Sms', 'Phone');
CREATE DOMAIN postal_code AS TEXT CHECK(VALUE ~ '^\d{5}$'OR VALUE ~ '^\d{5}-\d{4}$');

CREATE TABLE t_01 (id TEXT PRIMARY KEY, uuid_col uuid DEFAULT uuid_generate_v4(), name text COLLATE "en_US.utf8", c complex, info json, contact JSONB, arr smallint[], cash money, i inet, m macaddr, val smallint, age int, collated_data text collate "POSIX",date DATE, n NUMERIC (3, 2),r real, c1 CHAR(1), created_at timestamptz, uuid0 uuid DEFAULT uuid_nil(), uuid1 uuid DEFAULT uuid_generate_v1(), p1 POINT,t1 TIME,ts1 TIMESTAMP,i4 INTERVAL,p2 path, p3 polygon,b box, c2 circle, l line, l1 lseg, a2 text[][], zip postal_code, fk_id text, details e_details);
Setup unidirectional replication from Src to Target
bin/yb-admin --master_addresses 172.151.31.238,172.151.46.127,172.151.62.188 setup_universe_replication 905bc1b0-d497-4a7a-a54b-872ae01beec1_src_to_target_01 172.151.30.76,172.151.40.132,172.151.55.223 0000400000003000800000000000401a
  • Insert data in source and target and verify replication
DO
$do$
BEGIN 
   FOR i IN 1..60000 LOOP
      EXECUTE format('INSERT INTO t_01(id, name, c, info, contact, arr, cash, i, m,val,age,collated_data, date,n,r,c1,created_at,p1,t1,ts1,i4,p2,p3,c2,l,l1,a2,zip,details) VALUES (%s,''Snowy'',(1,2), ''{"Snowy": "John Doe", "items": {"product": "Laptop","qty": 6}}'',''{"Snowy":[ {"type": "mobile", "phone": "001001"}, {"type": "fix", "phone": "002002"}]}'', ''{1, 2, 3}'', ''$99.99'' ,''1.1.1.1'', ''00:00:00:00:00:00'',666,80000, md5(random()::text), current_timestamp,5.36,2147483647,''G'',now(), point(3, 4),LOCALTIME(0), current_timestamp(0),  ''00:25:00'', ''(1,3), (4,12)'',''(1,3),(4,12), (2,4)'',''10, 4, 10'',''{1,2,3}'',''(0, 4), (2,8)'',''{{"b1", "c"}, {"m1", "l"}}'',89000,''Sms'');',i);
   END LOOP;
END
$do$;
  • Upgrade source universe to 2.17.3.0-b55 from the platform UI.
  • Delete the replication
bin/yb-admin --master_addresses 172.151.31.238,172.151.46.127,172.151.62.188 delete_universe_replication 905bc1b0-d497-4a7a-a54b-872ae01beec1_src_to_target_01
  • Turned on YSQL packed storage in the source universe in master and tserver
  • Setup unidirectional replication from Src to Target
bin/yb-admin --master_addresses 172.151.31.238,172.151.46.127,172.151.62.188 setup_universe_replication 905bc1b0-d497-4a7a-a54b-872ae01beec1_src_to_target_01 172.151.30.76,172.151.40.132,172.151.55.223 0000400000003000800000000000401a
  • Insert data in the table
DO
$do$
BEGIN 
   FOR i IN 60001..90000 LOOP
      EXECUTE format('INSERT INTO t_01(id, name, c, info, contact, arr, cash, i, m,val,age,collated_data, date,n,r,c1,created_at,p1,t1,ts1,i4,p2,p3,c2,l,l1,a2,zip,details) VALUES (%s,''Snowy'',(1,2), ''{"Snowy": "John Doe", "items": {"product": "Laptop","qty": 6}}'',''{"Snowy":[ {"type": "mobile", "phone": "001001"}, {"type": "fix", "phone": "002002"}]}'', ''{1, 2, 3}'', ''$99.99'' ,''1.1.1.1'', ''00:00:00:00:00:00'',666,80000, md5(random()::text), current_timestamp,5.36,2147483647,''G'',now(), point(3, 4),LOCALTIME(0), current_timestamp(0),  ''00:25:00'', ''(1,3), (4,12)'',''(1,3),(4,12), (2,4)'',''10, 4, 10'',''{1,2,3}'',''(0, 4), (2,8)'',''{{"b1", "c"}, {"m1", "l"}}'',89000,''Sms'');',i);
   END LOOP;
END
$do$;
  • Executed a select on the target universe table
db_01=# select count(*) from t_01;
FATAL:  terminating connection due to administrator command
ERROR:  Aborted: Failed to schedule: 0x0000000002a80010 -> Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none])
select * from t_01 limit 1;
FATAL:  terminating connection due to administrator command
ERROR:  Aborted: Failed to schedule: 0x00000000033ddb10 -> Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none])

CC: @lingamsandeep , @rthallamko3

Target tserver has the following FATAL

F0322 10:23:13.514407 20987 primitive_value.cc:1266] Invalid value of enum ValueType (full enum type: yb::docdb::ValueType, expression: value_type): 122.
F0322 10:23:13.514407 20994 primitive_value.cc:1266] Invalid value of enum ValueType (full enum type: yb::docdb::ValueType, expression: value_type): 122.

Logs: yb-support-bundle-ksreenivasan-target-14083-20230322102737.689-logs.tar.gz yb-support-bundle-ksreenivasan-src-14083-20230322102727.547-logs.tar.gz

kripasreenivasan avatar Mar 22 '23 10:03 kripasreenivasan

For XCluster + Packed columns, support was only added in 2.16. Given that the recommended sequence for upgrading from 2.12 -> 2.16+ is

  1. Delete any existing replication between source and target
  2. Upgrade both source and target universes to 2.16+
  3. Turn ON packed on both source and target
  4. Re-enable replication.

Given that this is an unsupported scenario, resolving this issue as by-design.

lingamsandeep avatar Mar 28 '23 16:03 lingamsandeep