relay didn't get latest "min location from subtasks", and when continue parsing heartbeat logic is wrong
Bug Report
Please answer these questions before submitting your issue. Thanks!
- What did you do? If possible, provide a recipe for reproducing the error.
two problems from https://github.com/pingcap/dm/pull/1518/checks?check_run_id=2165916757
1.1 when resuming relay from detecting upstream switching, relay didn't get the chance to run the code of "get min location from subtasks", so it will pull very old binlogs (introduced by https://github.com/pingcap/dm/pull/1507)
1.2 when we continue parsing relay log files because of needReParse, we may continue from a location between "GTIDEvent" and "XIDEvent", so replaceWithHeartbeat is wrongly calculated
for 1.1, seems the failure is caused by multiple mysql instance hiding from a same IP/source-id, so there maybe mixed server_uuid/ServerID for a relay unit.
[2021/03/22 12:29:06.581 +00:00] [DEBUG] [relay.go:466] ["receive binlog event with header"] [component="relay log"] [header="{\"Timestamp\":1616416137,\"EventType\":16,\"ServerID\":1,\"EventSize\":31,\"LogPos\":927,\"Flags\":0}"]
[2021/03/22 12:29:06.581 +00:00] [DEBUG] [relay.go:501] ["writing binlog event"] [component="relay log"] [header="{\"Timestamp\":1616416137,\"EventType\":16,\"ServerID\":1,\"EventSize\":31,\"LogPos\":927,\"Flags\":0}"]
[2021/03/22 12:29:06.582 +00:00] [DEBUG] [relay.go:466] ["receive binlog event with header"] [component="relay log"] [header="{\"Timestamp\":1616416137,\"EventType\":33,\"ServerID\":2,\"EventSize\":84,\"LogPos\":1011,\"Flags\":0}"]
[2021/03/22 12:29:06.582 +00:00] [DEBUG] [relay.go:501] ["writing binlog event"] [component="relay log"] [header="{\"Timestamp\":1616416137,\"EventType\":33,\"ServerID\":2,\"EventSize\":84,\"LogPos\":1011,\"Flags\":0}"]
~~we can't start parsing a WriteRowsEvent without seeing its previous TableMapEvent~~ @GMHDBJD
[2021/03/24 20:04:55.985 +08:00] [DEBUG] [reader.go:548] ["start parse relay log file"] [task=test] [unit="binlog replication"] [component="binlog reader"] [file=/tmp/dm_test/new_relay/worker1/relay_log/09bec856-ba95-11ea-850a-58f2b4af5188.000001/mysql-bin.000001] [offset=1008]
...
[2021/03/24 20:04:55.986 +08:00] [ERROR] [reader.go:557] ["parse relay log file"] [task=test] [unit="binlog replication"] [component="binlog reader"] [file=/tmp/dm_test/new_relay/worker1/relay_log/09bec856-ba95-11ea-850a-58f2b4af5188.000001/mysql-bin.000001] [offset=1008] [error="Header &replication.EventHeader{Timestamp:0x605b2ae7, EventType:0x1e, ServerID:0x1, EventSize:0x35, LogPos:0x4e8, Flags:0x0}, Data \"ԑ\\x00\\x00\\x00\\x00\\x01\\x00\\x02\\x00\\x02\\xff\\xfc\\x03\\x00\\x00\\x00\\fEddard Stark\", Err: table id 37332: invalid table id, no corresponding table map event"] [errorVerbose="Header &replication.EventHeader{Timestamp:0x605b2ae7, EventType:0x1e, ServerID:0x1, EventSize:0x35, LogPos:0x4e8, Flags:0x0}, Data \"ԑ\\x00\\x00\\x00\\x00\\x01\\x00\\x02\\x00\\x02\\xff\\xfc\\x03\\x00\\x00\\x00\\fEddard Stark\", Err: table id 37332: invalid table id, no corresponding table map event\ngithub.com/pingcap/errors.AddStack\n\t/Users/lance/Projects/go/pkg/mod/github.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/Users/lance/Projects/go/pkg/mod/github.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/siddontang/go-mysql/replication.(*BinlogParser).parseSingleEvent\n\t/Users/lance/Projects/go/pkg/mod/github.com/lance6716/[email protected]/replication/parser.go:149\ngithub.com/siddontang/go-mysql/replication.(*BinlogParser).ParseReader\n\t/Users/lance/Projects/go/pkg/mod/github.com/lance6716/[email protected]/replication/parser.go:166\ngithub.com/siddontang/go-mysql/replication.(*BinlogParser).ParseFile\n\t/Users/lance/Projects/go/pkg/mod/github.com/lance6716/[email protected]/replication/parser.go:96\ngithub.com/pingcap/dm/pkg/streamer.(*BinlogReader).parseFile\n\t/Users/lance/Projects/dm/pkg/streamer/reader.go:552\ngithub.com/pingcap/dm/pkg/streamer.(*BinlogReader).parseFileAsPossible\n\t/Users/lance/Projects/dm/pkg/streamer/reader.go:414\ngithub.com/pingcap/dm/pkg/streamer.(*BinlogReader).parseDirAsPossible\n\t/Users/lance/Projects/dm/pkg/streamer/reader.go:381\ngithub.com/pingcap/dm/pkg/streamer.(*BinlogReader).parseRelay\n\t/Users/lance/Projects/dm/pkg/streamer/reader.go:310\ngithub.com/pingcap/dm/pkg/streamer.(*BinlogReader).StartSyncByGTID.func1\n\t/Users/lance/Projects/dm/pkg/streamer/reader.go:283\nruntime.goexit\n\t/usr/local/Cellar/go/1.13.6/libexec/src/runtime/asm_amd64.s:1357"]
~~will fix tomorrow~~
Seems we parse from WriteRowsEvent first when reparse. I think 1.0 has the same problem.
[2021/03/24 20:04:55.986 +08:00] [ERROR] [reader.go:557] ["parse relay log file"] [task=test] [unit="binlog replication"] [component="binlog reader"] [file=/tmp/dm_test/new_relay/worker1/relay_log/09bec856-ba95-11ea-850a-58f2b4af5188.000001/mysql-bin.000001] [offset=1008] [error="Header &replication.EventHeader{Timestamp:0x605b2ae7, EventType:0x1e, ServerID:0x1, EventSize:0x35, LogPos:0x4e8, Flags:0x0}, Data "ԑ\x00\x00\x00\x00\x01\x00\x02\x00\x02\xff\xfc\x03\x00\x00\x00\fEddard Stark", Err: table id 37332: invalid table id, no corresponding table map event"] [errorVerbose="Header &replication.EventHeader{Timestamp:0x605b2ae7, EventType:0x1e, ServerID:0x1, EventSize:0x35, LogPos:0x4e8, Flags:0x0}, Data "ԑ\x00\x00\x00\x00\x01\x00\x02\x00\x02\xff\xfc\x03\x00\x00\x00\fEddard Stark", Err: table id 37332: invalid table id, no corresponding table map event\ngithub.com/pingcap/errors.AddStack\n\t/Users/lance/Projects/go/pkg/mod/github.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/Users/lance/Projects/go/pkg/mod/github.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/siddontang/go-mysql/replication.(*BinlogParser).parseSingleEvent\n\t/Users/lance/Projects/go/pkg/mod/github.com/lance6716/[email protected]/replication/parser.go:149\ngithub.com/siddontang/go-mysql/replication.(*BinlogParser).ParseReader\n\t/Users/lance/Projects/go/pkg/mod/github.com/lance6716/[email protected]/replication/parser.go:166\ngithub.com/siddontang/go-mysql/replication.(*BinlogParser).ParseFile\n\t/Users/lance/Projects/go/pkg/mod/github.com/lance6716/[email protected]/replication/parser.go:96\ngithub.com/pingcap/dm/pkg/streamer.(*BinlogReader).parseFile\n\t/Users/lance/Projects/dm/pkg/streamer/reader.go:552\ngithub.com/pingcap/dm/pkg/streamer.(*BinlogReader).parseFileAsPossible\n\t/Users/lance/Projects/dm/pkg/streamer/reader.go:414\ngithub.com/pingcap/dm/pkg/streamer.(*BinlogReader).parseDirAsPossible\n\t/Users/lance/Projects/dm/pkg/streamer/reader.go:381\ngithub.com/pingcap/dm/pkg/streamer.(*BinlogReader).parseRelay\n\t/Users/lance/Projects/dm/pkg/streamer/reader.go:310\ngithub.com/pingcap/dm/pkg/streamer.(*BinlogReader).StartSyncByGTID.func1\n\t/Users/lance/Projects/dm/pkg/streamer/reader.go:283\nruntime.goexit\n\t/usr/local/Cellar/go/1.13.6/libexec/src/runtime/asm_amd64.s:1357"]
Seems we parse from WriteRowsEvent first when reparse. I think 1.0 has the same problem.
Now I think the r.parser is stateful, thus could know it has seen TableMapEvent last time it parse. I'll investigate tomorrow
I remember that parser will parse TableMapEvent and WriteEvent together, and may not stateful. good night. 🤣
I remember that parser will parse TableMapEvent and WriteEvent together, and may not stateful. good night. 🤣
We continue using same r.parser, so RowsEvent which is read after a resuming still has access to r.parser's known table ID.
there's some error filter about errMissingTableMapEvent. Will check if this can support DM-worker restarting on relay.meta after release
https://github.com/siddontang/go-mysql/blob/0c5789dd0bd378b4b84f99b320a2d35a80d8858f/replication/parser.go#L168