vitess
vitess copied to clipboard
VReplication: Improve handling of vplayer stalls
Description
Please see the issue for details about the problems we're attempting to solve/improve in this PR.
The improvements here are about detecting when we're not making any progress and showing/logging meaningful errors to replace the eventual generic EOF errors.
There are two types of stalls that we now detect and error on:
- Being unable to record heartbeats in the stream
- This is a simple check and always enabled
- Being unable to complete/commit a transaction consisting of replicated user events, which includes updating the vreplication record to record the latest position upon commit
- This is disabled by default via the new flag to control the timeout defaulting to 0s. This is a more complicated check and given that the scenario that we're improving is, to the best of my knowledge, very rare I did not want to potentially introduce new edge cases by default (especially since the heartbeat change should catch many of them).
Related Issue(s)
- Fixes: https://github.com/vitessio/vitess/issues/15974
Checklist
- [x] "Backport to:" labels have been added if this change should be back-ported to release branches
- [x] If this change is to be back-ported to previous releases, a justification is included in the PR description
- [x] Tests were added or are not required
- [x] Did the new or modified tests pass consistently locally and on CI?
- [x] Documentation was added or is not required
Review Checklist
Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.
General
- [ ] Ensure that the Pull Request has a descriptive title.
- [ ] Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.
Tests
- [ ] Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.
Documentation
- [ ] Apply the
release notes (needs details)
label if users need to know about this change. - [ ] New features should be documented.
- [ ] There should be some code comments as to why things are implemented the way they are.
- [ ] There should be a comment at the top of each new or modified test to explain what the test does.
New flags
- [ ] Is this flag really necessary?
- [ ] Flag names must be clear and intuitive, use dashes (
-
), and have a clear help text.
If a workflow is added or modified:
- [ ] Each item in
Jobs
should be named in order to mark it asrequired
. - [ ] If the workflow needs to be marked as
required
, the maintainer team must be notified.
Backward compatibility
- [ ] Protobuf changes should be wire-compatible.
- [ ] Changes to
_vt
tables and RPCs need to be backward compatible. - [ ] RPC changes should be compatible with vitess-operator
- [ ] If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
- [ ]
vtctl
command output order should be stable andawk
-able.
Codecov Report
Attention: Patch coverage is 85.29412%
with 5 lines
in your changes missing coverage. Please review.
Please upload report for BASE (
main@a16a030
). Learn more about missing BASE report.
Files | Patch % | Lines |
---|---|---|
.../vt/vttablet/tabletmanager/vreplication/vplayer.go | 61.53% | 5 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #15797 +/- ##
=======================================
Coverage ? 68.56%
=======================================
Files ? 1544
Lines ? 197779
Branches ? 0
=======================================
Hits ? 135608
Misses ? 62171
Partials ? 0
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
With the stall detection enabled, I was a bit concerned that we were creating many goroutines, since we are creating one for every transaction? So I added some logging (https://github.com/vitessio/vitess/commit/98cb46242f1d719cabe191a3c842ecb38fbb9cd9) and ran the TestFKExt
e2e test, which generates some load. Here are the last log lines for on some of the instances. So we create a lot of goroutines and some of them don't seem to be getting cleaned up.
vttablet.RSs-Laptop.rohit.log.INFO.20240524-135957.9653:I0524 14:00:24.159539 9653 vplayer.go:895] StallHandler-3902440483645348625: After stopping the stall handler goroutine for 15s, total: 900, active: 389 -- totalStallHandlerCount: 36
vttablet.RSs-Laptop.rohit.log.INFO.20240524-135957.9653:I0524 14:00:24.154294 9653 vplayer.go:895] StallHandler-4945214909004838383: After stopping the stall handler goroutine for 15s, total: 700, active: 208 -- totalStallHandlerCount: 36
vttablet.RSs-Laptop.rohit.log.INFO.20240524-135957.9653:I0524 14:00:20.501398 9653 vplayer.go:895] StallHandler-6799487594559643729: After stopping the stall handler goroutine for 15s, total: 11, active: 9 -- totalStallHandlerCount: 33
vttablet.RSs-Laptop.rohit.log.INFO.20240524-135957.9653:I0524 14:00:20.332871 9653 vplayer.go:895] StallHandler-2372108009013492043: After stopping the stall handler goroutine for 15s, total: 128, active: 123 -- totalStallHandlerCount: 30
I thought maybe the log was not accurate and the goroutines were getting cleaned up elsewhere, but after enabling pprof I see leaked goroutines at the end of the test on all the primaries.
http://localhost:16800/debug/pprof/goroutine?debug=1
goroutine profile: total 1553
1408 @ 0x1007f3118 0x1008065a4 0x101780d5c 0x10082d854
# 0x101780d5b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*stallHandler).startTimer.func1+0x25b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:898
http://localhost:16200/debug/pprof/goroutine?debug=1
goroutine profile: total 8604
8446 @ 0x104c8f118 0x104ca25a4 0x105c1cd5c 0x104cc9854
# 0x105c1cd5b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*stallHandler).startTimer.func1+0x25b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:898
http://localhost:16100/debug/pprof/goroutine?debug=1
goroutine profile: total 1787
1665 @ 0x10058b118 0x10059e5a4 0x101518d5c 0x1005c5854
# 0x101518d5b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*stallHandler).startTimer.func1+0x25b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:898
This is vplayer.go:898
select {
case <-sh.timer.Load().C: // The timer expired
sh.fire <- vterrors.Wrapf(ErrVPlayerStalled,
"failed to commit transaction batch before the configured --vplayer-progress-deadline of %v", vplayerProgressDeadline)
case <-sh.stop: // The timer was stopped
log.Infof("%s: the stall handler timer was stopped", sh)
}
A couple of thoughts here:
- Can we perform the stall check around the query execute(s)? Or is there a good reason to have this separate handler-based mechanism?
- Instead of a tablet level flag, what do you think about adding a workflow option (in the new json object). That way we can allow it to be updated using WorkflowUpdate and it can be dynamically set/reset if required when the vplayer restarts and loads the settings?
With the stall detection enabled, I was a bit concerned that we were creating many goroutines, since we are creating one for every transaction? So I added some logging (98cb462) and ran the
TestFKExt
e2e test, which generates some load. Here are the last log lines for on some of the instances. So we create a lot of goroutines and some of them don't seem to be getting cleaned up.
@rohit-nayak-ps This is great, thank you! ❤️ This made the problem obvious to me:
diff --git a/go/vt/vttablet/tabletmanager/vreplication/vplayer.go b/go/vt/vttablet/tabletmanager/vreplication/vplayer.go
index df59f2203e..f067ad0b93 100644
--- a/go/vt/vttablet/tabletmanager/vreplication/vplayer.go
+++ b/go/vt/vttablet/tabletmanager/vreplication/vplayer.go
@@ -866,12 +866,9 @@ func (sh *stallHandler) startTimer() error {
if swapped := sh.timer.CompareAndSwap(nil, time.NewTimer(sh.deadline)); !swapped {
// Otherwise, reset the timer.
if sh.timer.Load().Reset(sh.deadline) {
- // The timer was already running, so be sure the channel is drained.
- select {
- case <-sh.timer.Load().C:
- default:
- }
-
+ // The timer gorouting was already running, so now that we've reset the timer
+ // it's using we're done.
+ return nil
}
}
go func() {
I confirmed the results using your 98cb46242f1d719cabe191a3c842ecb38fbb9cd9
patch:
❯ go test -count 1 -v -failfast -timeout 15m -run TestFKExt vitess.io/vitess/go/test/endtoend/vreplication
=== RUN TestFKExt
E0524 13:51:02.383109 29560 vtorc_process.go:103] configuration - {
"Debug": true,
"ListenAddress": ":0",
"MySQLTopologyUser": "orc_client_user",
"MySQLTopologyPassword": "orc_client_user_password",
"MySQLReplicaUser": "vt_repl",
"MySQLReplicaPassword": "",
"RecoveryPeriodBlockSeconds": 1
}
=== RUN TestFKExt/Import_from_external_db
=== RUN TestFKExt/Import_from_external_db/vtctldclient_vdiff_target1.import
=== RUN TestFKExt/Import_from_external_db/vtctldclient_vdiff_source.import_reverse
=== RUN TestFKExt/Import_from_external_db/vtctldclient_vdiff_target1.import#01
=== RUN TestFKExt/MoveTables_from_unsharded_to_sharded_keyspace
=== RUN TestFKExt/MoveTables_from_unsharded_to_sharded_keyspace/vtctldclient_vdiff_target2.move
=== RUN TestFKExt/MoveTables_from_unsharded_to_sharded_keyspace/vtctldclient_vdiff_target1.move_reverse
=== RUN TestFKExt/MoveTables_from_unsharded_to_sharded_keyspace/vtctldclient_vdiff_target2.move#01
=== RUN TestFKExt/Materialize_parent_and_copy_tables_without_constraints
=== RUN TestFKExt/Reshard_keyspace_from_2_shards_to_3_shards
=== RUN TestFKExt/Reshard_keyspace_from_2_shards_to_3_shards/vtctldclient_vdiff_target2.reshard2to3
=== RUN TestFKExt/Reshard_keyspace_from_2_shards_to_3_shards/vtctlclient_vdiff_target2.reshard2to3_reverse
=== RUN TestFKExt/Reshard_keyspace_from_2_shards_to_3_shards/vtctldclient_vdiff_target2.reshard2to3#01
=== RUN TestFKExt/Reshard_keyspace_from_3_shards_to_1_shard
=== RUN TestFKExt/Reshard_keyspace_from_3_shards_to_1_shard/vtctldclient_vdiff_target2.reshard3to1
=== RUN TestFKExt/Reshard_keyspace_from_3_shards_to_1_shard/vtctlclient_vdiff_target2.reshard3to1_reverse
=== RUN TestFKExt/Reshard_keyspace_from_3_shards_to_1_shard/vtctldclient_vdiff_target2.reshard3to1#01
=== RUN TestFKExt/Validate_materialize_counts_at_end_of_test
--- PASS: TestFKExt (267.69s)
--- PASS: TestFKExt/Import_from_external_db (47.85s)
--- PASS: TestFKExt/Import_from_external_db/vtctldclient_vdiff_target1.import (5.15s)
--- PASS: TestFKExt/Import_from_external_db/vtctldclient_vdiff_source.import_reverse (5.13s)
--- PASS: TestFKExt/Import_from_external_db/vtctldclient_vdiff_target1.import#01 (5.24s)
--- PASS: TestFKExt/MoveTables_from_unsharded_to_sharded_keyspace (52.69s)
--- PASS: TestFKExt/MoveTables_from_unsharded_to_sharded_keyspace/vtctldclient_vdiff_target2.move (5.17s)
--- PASS: TestFKExt/MoveTables_from_unsharded_to_sharded_keyspace/vtctldclient_vdiff_target1.move_reverse (5.11s)
--- PASS: TestFKExt/MoveTables_from_unsharded_to_sharded_keyspace/vtctldclient_vdiff_target2.move#01 (5.31s)
--- PASS: TestFKExt/Materialize_parent_and_copy_tables_without_constraints (1.62s)
--- PASS: TestFKExt/Reshard_keyspace_from_2_shards_to_3_shards (85.44s)
--- PASS: TestFKExt/Reshard_keyspace_from_2_shards_to_3_shards/vtctldclient_vdiff_target2.reshard2to3 (8.45s)
--- PASS: TestFKExt/Reshard_keyspace_from_2_shards_to_3_shards/vtctlclient_vdiff_target2.reshard2to3_reverse (6.24s)
--- PASS: TestFKExt/Reshard_keyspace_from_2_shards_to_3_shards/vtctldclient_vdiff_target2.reshard2to3#01 (15.68s)
--- PASS: TestFKExt/Reshard_keyspace_from_3_shards_to_1_shard (61.60s)
--- PASS: TestFKExt/Reshard_keyspace_from_3_shards_to_1_shard/vtctldclient_vdiff_target2.reshard3to1 (5.19s)
--- PASS: TestFKExt/Reshard_keyspace_from_3_shards_to_1_shard/vtctlclient_vdiff_target2.reshard3to1_reverse (7.32s)
--- PASS: TestFKExt/Reshard_keyspace_from_3_shards_to_1_shard/vtctldclient_vdiff_target2.reshard3to1#01 (11.72s)
--- PASS: TestFKExt/Validate_materialize_counts_at_end_of_test (0.01s)
PASS
ok vitess.io/vitess/go/test/endtoend/vreplication 268.785s
❯ grep "stall handler" /opt/vtdataroot/tmp/*
...
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240524-135424.38709:I0524 13:55:17.701397 38709 vplayer.go:892] StallHandler-2511605806418698265: After stopping the stall handler goroutine for 15s, total: 110, active: 0 -- totalStallHandlerCount: 36
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240524-135424.38709:I0524 13:55:17.703183 38709 vplayer.go:878] StallHandler-2511605806418698265: Resetting the stall handler timer to 15s
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240524-135424.38709:I0524 13:55:17.703197 38709 vplayer.go:888] StallHandler-2511605806418698265: Starting the stall handler goroutine for 15s, total: 111, active: 1, totalStallHandlerCount: 36
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240524-135424.38709:I0524 13:55:17.703535 38709 vplayer.go:912] StallHandler-2511605806418698265: Stopping the stall handler timer
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240524-135424.38709:I0524 13:55:17.703544 38709 vplayer.go:900] StallHandler-2511605806418698265: the stall handler timer was stopped
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240524-135424.38709:I0524 13:55:17.703546 38709 vplayer.go:892] StallHandler-2511605806418698265: After stopping the stall handler goroutine for 15s, total: 111, active: 0 -- totalStallHandlerCount: 36
Thank you again! I'll have to remember that technique you used there — super handy.
A couple of thoughts here:
- Can we perform the stall check around the query execute(s)? Or is there a good reason to have this separate handler-based mechanism?
@rohit-nayak-ps In the case seen, it was not a single query that was timing out or otherwise problematic. The problem was that we could not commit the batch transaction which contained many slow queries. So it's a combination of a high write rate on the source with the slow queries on the target. So we end up with a big transaction batch in the vplayer that we're unable to commit as things back up — e.g. we're concurrently trying to update the _vt.vreplication
record to record the heartbeat too as the main active transaction from the previous reading of the relay log cannot complete before it tries to do that (and the relay log IO pauses as well and backs up because we're unable to drain its contents to execute as the next transaction batch because we're stuck trying to execute the previous one).
- Instead of a tablet level flag, what do you think about adding a workflow option (in the new json object). That way we can allow it to be updated using WorkflowUpdate and it can be dynamically set/reset if required when the vplayer restarts and loads the settings?
That's a good idea! With the new json field that will be much simpler and I like that better.
- Instead of a tablet level flag, what do you think about adding a workflow option (in the new json object). That way we can allow it to be updated using WorkflowUpdate and it can be dynamically set/reset if required when the vplayer restarts and loads the settings?
That's a good idea! With the new json field that will be much simpler and I like that better.
@rohit-nayak-ps I started on that work here: https://github.com/vitessio/vitess/pull/15797/commits/7b5c4c51496b4fe55a369b3e469dfe4b1995226f
There will still be more work to do, so I will let you know when it's ready for another review. Until then, you can see where I'm going if you like and let me know if you perhaps changed your mind (it's certainly much more involved this way, but I do think it's better overall).
- Instead of a tablet level flag, what do you think about adding a workflow option (in the new json object). That way we can allow it to be updated using WorkflowUpdate and it can be dynamically set/reset if required when the vplayer restarts and loads the settings?
That's a good idea! With the new json field that will be much simpler and I like that better.
@rohit-nayak-ps after spending quite a bit of time on it, I ended up reverting the work: https://github.com/vitessio/vitess/pull/15797/commits/b060e09c82229ddf72ad015e4b905977b628427c
There was still more work to do there, and in working on it, it became more clear to me that this really is a vplayer component setting rather than a workflow option. So passing a duration all the way through from client flags, for each type of stream, just to configure the vplayer on the tablet, seemed like a lot of work today and into the future — any command that does use a vplayer that we want to be able to configure this would need to have it too, e.g. OnlineDDL and ApplySchema also need them today as the original issue seen was with OnlineDDL. In such I think it makes sense for it to be a vttablet flag. If you feel strongly about it, however, I can pick that work back up. It was nearly to the point where you could set it, view it, and update it in the various client commands. From there we just need to read and unmarshal the value from the _vt.vreplication record in the vreplication engine and thread it through the controller, vreplicator, and vplayer.
@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check. I do believe that will still be able to catch all of the cases, including the one seen in the original production issue. In that case the position was not getting updated via the replicated transaction or the heartbeat recording, so we should still have gotten the stalled vttablet log and the issue noted in the workflow's message field.
In any event, I did another test with the latest state:
❯ grep "stall handler" /opt/vtdataroot/tmp/*
...
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.532924 12895 vplayer.go:897] StallHandler-2527617803385045679: After stopping the stall handler goroutine for 15s, total: 422, active: 0 -- totalStallHandlerCount: 36
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537210 12895 vplayer.go:878] StallHandler-2527617803385045679: Resetting the stall handler timer to 15s
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537223 12895 vplayer.go:893] StallHandler-2527617803385045679: Starting the stall handler goroutine for 15s, total: 423, active: 1, totalStallHandlerCount: 36
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537449 12895 vplayer.go:878] StallHandler-2527617803385045679: Resetting the stall handler timer to 15s
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537781 12895 vplayer.go:917] StallHandler-2527617803385045679: Stopping the stall handler timer
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537791 12895 vplayer.go:905] StallHandler-2527617803385045679: the stall handler timer was stopped
/opt/vtdataroot/tmp/vttablet.pslord.matt.log.INFO.20240525-023958.12895:I0525 02:40:27.537793 12895 vplayer.go:897] StallHandler-2527617803385045679: After stopping the stall handler goroutine for 15s, total: 423, active: 0 -- totalStallHandlerCount: 36
@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check.
This might indeed be a good path to choose, since your other change is catching the known issues.
Also while goroutines are lightweight, generating goroutines for every transaction could also lead to unpredictable performance impacts due to garbage collection, memory usage, scheduling issues etc especially since we will be enabling it in a high-qps situation.
@rohit-nayak-ps I could also just remove the transaction duration (stallHandler) part and stick with the simpler heartbeat based check.
This might indeed be a good path to choose, since your other change is catching the known issues.
Also while goroutines are lightweight, generating goroutines for every transaction could also lead to unpredictable performance impacts due to garbage collection, memory usage, scheduling issues etc especially since we will be enabling it in a high-qps situation.
@rohit-nayak-ps Having more than 1 active/concurrent goroutine per vplayer
/stallHandler
was a bug. The intention was for this to be a lock-free idempotent stall handler that was very difficult to mess up — precisely because it's a high QPS component with a somewhat awkward way of managing the transactions so you may do multiple BEGINs before a COMMIT e.g. There should only ever be 0 or 1 goroutines active/running per vplayer
/stallHander
instance. After spending so much time on the investigation/testing/debugging part of the work I was a bit too eager to mark this as ready for review once my manual test and the new unit test were passing. That was sloppy on my part — I missed the bug in my self review and I clearly didn't have adequate testing, I apologize. Your thorough review was much appreciated though as you caught it. ❤️
I've added comments and a new unit test that demonstrate it now working as intended: https://github.com/vitessio/vitess/pull/15797/commits/3839e901b2349fdaed080595461cbb36c1ee497d
I wanted to get it working so that even if I do end up discarding it here we'll still have it in the PR history if we ever want to revive it or reuse it for something else.
With all that being said, let me know what you think. I'm fine removing it for now since it was only well into my testing and debugging that I realized the existing heartbeat mechanism combined with the vreplicationMinimumHeartbeatUpdateInterval
flag/variable were already supposed to effectively be serving the role of detecting stalls. The problem was a bug in that handling where we cleared the pending heartbeat counter before we were successfully able to save/record the heartbeat (we assumed it would be immediately successful). So now that I'm well aware of it and have also confirmed that mechanism is working as it should (it's used in the ROW based test case in the new TestPlayerStalls
unit test) I think that the stallHandler
mechanism is at least largely redundant. If I'm being honest, the main reason I would not want to delete it is that I spent some time on it. 🙂 That's obviously not a great reason to keep it and now that it's working and saved somewhere (commit history here), I'm totally fine removing it if you and others prefer it. Thanks again!
I think that the stallHandler mechanism is at least largely redundant. If I'm being honest, the main reason I would not want to delete it is that I spent some time on it.
@mattlord now that's it's been said, it cannot be unsaid... :) Looking at the PR changes, the stall handler takes a considerable part of the actual logic changes. So I'd be curious to see a solution where the stall handler does not exist.
@rohit-nayak-ps and @shlomi-noach this removes all of the stallHandler
related code: https://github.com/vitessio/vitess/pull/15797/commits/c666b1442b216413bd36f31926489cc0581af511
What we lose w/o it is the ability to perform out-of-band monitoring and errors. Meaning that the heartbeat method will only detect a stall when it was due to a failure to commit the transaction which updates the timestamp for the workflow (whether it was done on its own or as part of replicating user generated events).
If you both prefer that then I'll update the PR description accordingly. Thanks!
@mattlord thank you, I think we should go with this change, seeing that it's so succinct.
Same here. Thanks @mattlord.
Good news is that I can repeat the exact scenario and symptoms seen in production with the Test case which demonstrates a more typical scenario that would be seen in production
here: https://github.com/vitessio/vitess/issues/15974
Bad news is that neither of the two methods added here are generating the helpful error message. I'm still getting:
{
"id": "21",
"stream_id": "1",
"type": "Message",
"state": "Running",
"created_at": {
"seconds": "1716998552",
"nanoseconds": 0
},
"updated_at": {
"seconds": "1716998552",
"nanoseconds": 0
},
"message": "vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ 1cb76134-1dc9-11ef-82c4-5f4020e10986:1-546395: unexpected EOF\nio.ReadFull(packet body of length 78) failed (errno 2013) (sqlstate HY000)",
"count": "1"
}
So I need to keep digging...
@rohit-nayak-ps and @shlomi-noach OK, now that I'm able to repeat the exact issue seen in production (which kicked off this work) — I was missing a detail in that test case which I only realized this week — I know that I've now fixed it with the latest commit.
Before that commit, even with the stallHandler
related code in place, I was still only getting the eventual not-terribly-helpful error that had been seen in production:
# from the 'vtctldclient workflow show' output
"message": "vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ 1cb76134-1dc9-11ef-82c4-5f4020e10986:1-537840: unexpected EOF\nio.ReadFull(packet body of length 2658) failed (errno 2013) (sqlstate HY000)",
❯ grep -Ri stalled ${VTDATAROOT}/tmp/vttablet*ERROR*
It's quite some time before that error surfaces and when in the state leading up to it you cannot even stop
or show
the workflow either as things have backed up so badly.
With the latest commit, however (vplayer-progress-deadline=5m
and we are in fact not making any progress — the position in the workflow is not moving forward):
# from the 'vtctldclient workflow show' output
"message": "relay log I/O stalled: progress stalled; vplayer was unable to replicate the transaction in a timely manner; examine the target mysqld instance health and the replicated queries' EXPLAIN output to see why queries are taking unusually long",
❯ grep -Ri stalled ${VTDATAROOT}/tmp/vttablet*ERROR*
E0530 19:24:37.570804 71715 dbclient.go:128] error in stream 3, will retry after 5s: relay log I/O stalled: progress stalled; vplayer was unable to replicate the transaction in a timely manner; examine the target mysqld instance health and the replicated queries' EXPLAIN output to see why queries are taking unusually long
E0530 19:31:04.982780 71715 dbclient.go:128] error in stream 3, will retry after 5s: relay log I/O stalled: progress stalled; vplayer was unable to replicate the transaction in a timely manner; examine the target mysqld instance health and the replicated queries' EXPLAIN output to see why queries are taking unusually long
E0530 19:37:34.468556 71715 dbclient.go:128] error in stream 3, will retry after 5s: relay log I/O stalled: progress stalled; vplayer was unable to replicate the transaction in a timely manner; examine the target mysqld instance health and the replicated queries' EXPLAIN output to see why queries are taking unusually long
...
And you can start
/stop
/show
the workflow during this time as we're not letting things back up so badly.
What the PR now does is detect stalls on both "ends" in the vplayer:
- It will detect relay log I/O stalls — where we cannot successfully replicate the previous relay log contents, continually retrying to, so the relay log fills up with the next batch and we cannot write to it anymore as we hit the max relay log size / items. This a stall on the input side of the vplayer.
- It will detect heartbeat recording failures — where we cannot record the heartbeat in the workflow record within the minimum heartbeat interval. This is a stall on the output side of the vplayer.
I will now update the PR description as well. I will also add the Do Not Merge
label on the PR since we should not merge this until we've cut the v20 RC and main becomes v21. Given that, no rush on the reviews but I appreciate it whenever you can. 🙏
It says this in the description:
It's quite some time before that error surfaces when in this stalled state and when in the state leading up to this error you cannot even stop or show the workflow either as things have backed up so badly.
What causes show
to hang?
It says this in the description:
It's quite some time before that error surfaces when in this stalled state and when in the state leading up to this error you cannot even stop or show the workflow either as things have backed up so badly.
What causes
show
to hang?
@rohit-nayak-ps I'm sorry, I was wrong on the show
part. I've confirmed that in new tests and now that I think about it, I WAS able to run show
during the production incident as well. It was only stop
that hung (I'll update the description accordingly). (I did see show
hang in earlier tests but that was due to things being hung in a different way from various tests.)
For the stop
in my tests...
1 @ 0x10032f118 0x1002f8b64 0x1002f8724 0x10128d288 0x101291f78 0x1014f0bc0 0x1014f0ba5 0x10192670c 0x101906f30 0x1008c9cc8 0x1008cdd40 0x1008c808c 0x100369854
# 0x10128d287 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*controller).Stop+0x47 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go:325
# 0x101291f77 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*Engine).exec+0xcd7 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/engine.go:452
# 0x1014f0bbf vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*Engine).Exec+0xb6f vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/engine.go:348
# 0x1014f0ba4 vitess.io/vitess/go/vt/vttablet/tabletmanager.(*TabletManager).UpdateVReplicationWorkflow+0xb54 vitess.io/vitess/go/vt/vttablet/tabletmanager/rpc_vreplication.go:472
# 0x10192670b vitess.io/vitess/go/vt/vttablet/grpctmserver.(*server).UpdateVReplicationWorkflow+0x11b vitess.io/vitess/go/vt/vttablet/grpctmserver/server.go:423
# 0x101906f2f vitess.io/vitess/go/vt/proto/tabletmanagerservice._TabletManager_UpdateVReplicationWorkflow_Handler+0x1bf vitess.io/vitess/go/vt/proto/tabletmanagerservice/tabletmanagerservice_grpc.pb.go:1634
# 0x1008c9cc7 google.golang.org/grpc.(*Server).processUnaryRPC+0xb57 google.golang.org/[email protected]/server.go:1369
# 0x1008cdd3f google.golang.org/grpc.(*Server).handleStream+0xb1f google.golang.org/[email protected]/server.go:1780
# 0x1008c808b google.golang.org/grpc.(*Server).serveStreams.func2.1+0x8b google.golang.org/[email protected]/server.go:1019
That's here: https://github.com/vitessio/vitess/blob/8c978574a3e5e00fbfeb30ad86437f72ea7d5276/go/vt/vttablet/tabletmanager/vreplication/controller.go#L322-L326
Which is waiting for run()
to end as that done
channel is closed in a defer there.
That goroutine in turn is blocked on this one:
1 @ 0x10032f118 0x1002f8b64 0x1002f8724 0x1012ba348 0x1012ba254 0x1012b9a1c 0x1012bd548 0x1012bcb58 0x10128c28c 0x10128b9a8 0x100369854
# 0x1012ba347 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func4+0x37 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:304
# 0x1012ba253 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply+0x573 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:309
# 0x1012b9a1b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).play+0x31b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:210
# 0x1012bd547 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vreplicator).replicate+0x5d7 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vreplicator.go:311
# 0x1012bcb57 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vreplicator).Replicate+0x27 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vreplicator.go:178
# 0x10128c28b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*controller).runBlp+0x52b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go:257
# 0x10128b9a7 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*controller).run+0x67 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go:153
Which is here at line 293 (line numbers are a little off as this was done in my PR branch): https://github.com/vitessio/vitess/blob/8c978574a3e5e00fbfeb30ad86437f72ea7d5276/go/vt/vttablet/tabletmanager/vreplication/vplayer.go#L289-L294
And the goroutine for that function:
1 @ 0x10032f118 0x100328948 0x100362b70 0x10040efc8 0x100410310 0x100410301 0x1004b0d98 0x1004c0a14 0x100542c14 0x1003a87c0 0x100dd6ff8 0x100dd6fd1 0x100dd7230 0x100dec9b8 0x100deb964 0x100deb588 0x100deb008 0x101232080 0x1012b8684 0x1012b8c74 0x1012b8c61 0x1012b96c4 0x1012baa28 0x10129bf5c 0x10129aa14 0x1012ba8f8 0x1012bc19c 0x1012bb5ac 0x1012ba394 0x100369854
# 0x100362b6f internal/poll.runtime_pollWait+0x9f runtime/netpoll.go:345
# 0x10040efc7 internal/poll.(*pollDesc).wait+0x27 internal/poll/fd_poll_runtime.go:84
# 0x10041030f internal/poll.(*pollDesc).waitRead+0x1ff internal/poll/fd_poll_runtime.go:89
# 0x100410300 internal/poll.(*FD).Read+0x1f0 internal/poll/fd_unix.go:164
# 0x1004b0d97 net.(*netFD).Read+0x27 net/fd_posix.go:55
# 0x1004c0a13 net.(*conn).Read+0x33 net/net.go:179
# 0x100542c13 bufio.(*Reader).Read+0x1b3 bufio/bufio.go:241
# 0x1003a87bf io.ReadAtLeast+0x9f io/io.go:335
# 0x100dd6ff7 io.ReadFull+0x47 io/io.go:354
# 0x100dd6fd0 vitess.io/vitess/go/mysql.(*Conn).readHeaderFrom+0x20 vitess.io/vitess/go/mysql/conn.go:389
# 0x100dd722f vitess.io/vitess/go/mysql.(*Conn).readEphemeralPacket+0x6f vitess.io/vitess/go/mysql/conn.go:427
# 0x100dec9b7 vitess.io/vitess/go/mysql.(*Conn).readComQueryResponse+0x37 vitess.io/vitess/go/mysql/query.go:544
# 0x100deb963 vitess.io/vitess/go/mysql.(*Conn).ReadQueryResult+0x53 vitess.io/vitess/go/mysql/query.go:396
# 0x100deb587 vitess.io/vitess/go/mysql.(*Conn).ExecuteFetchMulti+0x97 vitess.io/vitess/go/mysql/query.go:364
# 0x100deb007 vitess.io/vitess/go/mysql.(*Conn).ExecuteFetch+0x27 vitess.io/vitess/go/mysql/query.go:318
# 0x10123207f vitess.io/vitess/go/vt/binlog/binlogplayer.(*dbClientImpl).ExecuteFetch+0x2f vitess.io/vitess/go/vt/binlog/binlogplayer/dbclient.go:144
# 0x1012b8683 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vdbClient).ExecuteFetch+0x1b3 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vdbclient.go:123
# 0x1012b8c73 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vdbClient).Execute+0x43 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vdbclient.go:166
# 0x1012b8c60 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vdbClient).ExecuteWithRetry+0x30 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vdbclient.go:170
# 0x1012b96c3 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.newVPlayer.func1+0x43 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:128
# 0x1012baa27 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyRowEvent.func1+0xd7 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:348
# 0x10129bf5b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.execParsedQuery+0x5b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/replicator_plan.go:561
# 0x10129aa13 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*TablePlan).applyChange+0x4a3 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/replicator_plan.go:412
# 0x1012ba8f7 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyRowEvent+0x1e7 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:373
# 0x1012bc19b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvent+0x82b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:665
# 0x1012bb5ab vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).applyEvents+0x64b vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:547
# 0x1012ba393 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication.(*vplayer).fetchAndApply.func2+0x33 vitess.io/vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go:281
Which is trying and retrying to commit the last relay log batch, cycling through queries like this:
| 460472 | vt_filtered | localhost | vt_commerce | Query | 1 | updating | update _vt_vrp_a505f42c1f5611ef9043920702940ee1_20240531100407_ set updated_at='2024-05-31 10:05:38', created_at='2024-05-29 11:54:37', colstuff='foobar' where id=(convert('126062630024235_id' using utf8mb4)) and id2='186062188424265_id2' |
Each of which are doing a table scan so take a few seconds.
So it looks like we process the stop request and cancel the controller's context, which then causes the vstreamer
to generate an io.EOF
stream error which the vplayer
gets and then it waits for the fetch and apply goroutine to end and close its error channel before the run can complete and close its done channel and the controller finally ends. I think that we're stuck down in the MySQL layer for some time (it DOES eventually stop) which does not check the context.
@mattlord please go ahead and merge this after removing the flag.