etcd icon indicating copy to clipboard operation
etcd copied to clipboard

cache: [WIP] implement MVP watch demux (Code Review Only)

Open apullo777 opened this issue 7 months ago • 4 comments

This is a WIP draft PR created to start the code review process. Feedback welcome!

What’s in this PR

  • implements the MVP Watch demultiplexer only, per earlier discussion with mentors
  • defers the storage & indexing layer to next stage, so we can focus on correctness and robustness of the core demux

What's next

  • continue writing unit tests for all demux/watch behaviors (catch-up, backoff, drop threshold, etc.)

Tagging mentors for code review: @serathius @MadhavJivrajani

apullo777 avatar Jun 11 '25 22:06 apullo777

Hi @apullo777. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 11 '25 22:06 k8s-ci-robot

Overall note on architecture. You have two separate synchronization methods, first one sends events available in ringBuffer upon the watch creation, and second broadcasts new events to synchronized watchers. If a synchronized watcher don't handle the event, the only option you have is to drop it (count missed events is not correct). With watcher buffer 128 and shared buffer 2048 if watch just requests old revision, the per watcher buffer will be filled and broadcast will immediately fail. Fix to that is to keep a list of unsychronized watchers and periodically attempt to synchronize them.

serathius avatar Jun 18 '25 10:06 serathius

This looks much better, good job!

Expect maybe 1-2 more iterations and this will be mergable.

serathius avatar Jun 19 '25 12:06 serathius

Note, please close comments that you have addressed.

serathius avatar Jun 23 '25 08:06 serathius

Just couple of last comments, and this code is in mergeable. Please follow https://github.com/etcd-io/etcd/pull/20160/checks?check_run_id=44807170862 instructions to fix the DCO and remove the draft.

serathius avatar Jun 26 '25 08:06 serathius

Please also remove the conflict

serathius avatar Jun 27 '25 10:06 serathius

/ok-to-test

serathius avatar Jun 27 '25 10:06 serathius

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 65.85%. Comparing base (4b98469) to head (2c311cc). Report is 2 commits behind head on main.

Additional details and impacted files

see 54 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20160      +/-   ##
==========================================
- Coverage   69.00%   65.85%   -3.16%     
==========================================
  Files         415      412       -3     
  Lines       34595    34460     -135     
==========================================
- Hits        23873    22693    -1180     
- Misses       9328    10379    +1051     
+ Partials     1394     1388       -6     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4b98469...2c311cc. Read the comment docs.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 29 '25 19:06 codecov[bot]

Please fix:

cache_test.go:89:28: context.Background() could be replaced by t.Context() in TestCacheWatcherSeesEntireKeyspace (usetesting)
	if err := cache.WaitReady(context.Background()); err != nil {

serathius avatar Jun 30 '25 10:06 serathius

/retest pull-etcd-release-tests

serathius avatar Jun 30 '25 10:06 serathius

@serathius: The /retest command does not accept any targets. The following commands are available to trigger required jobs:

/test pull-etcd-build
/test pull-etcd-contrib-mixin
/test pull-etcd-coverage-report
/test pull-etcd-e2e-386
/test pull-etcd-e2e-amd64
/test pull-etcd-e2e-arm64
/test pull-etcd-fuzzing-v3rpc
/test pull-etcd-govulncheck-main
/test pull-etcd-grpcproxy-e2e-amd64
/test pull-etcd-grpcproxy-e2e-arm64
/test pull-etcd-grpcproxy-integration-amd64
/test pull-etcd-grpcproxy-integration-arm64
/test pull-etcd-integration-1-cpu-amd64
/test pull-etcd-integration-1-cpu-arm64
/test pull-etcd-integration-2-cpu-amd64
/test pull-etcd-integration-2-cpu-arm64
/test pull-etcd-integration-4-cpu-amd64
/test pull-etcd-integration-4-cpu-arm64
/test pull-etcd-markdown-lint
/test pull-etcd-release-tests
/test pull-etcd-robustness-amd64
/test pull-etcd-robustness-arm64
/test pull-etcd-unit-test-386
/test pull-etcd-unit-test-amd64
/test pull-etcd-unit-test-arm64
/test pull-etcd-verify

Use /test all to run the following jobs that were automatically triggered:

pull-etcd-build
pull-etcd-contrib-mixin
pull-etcd-coverage-report
pull-etcd-e2e-386
pull-etcd-e2e-amd64
pull-etcd-e2e-arm64
pull-etcd-fuzzing-v3rpc
pull-etcd-govulncheck-main
pull-etcd-grpcproxy-e2e-amd64
pull-etcd-grpcproxy-e2e-arm64
pull-etcd-grpcproxy-integration-amd64
pull-etcd-grpcproxy-integration-arm64
pull-etcd-integration-1-cpu-amd64
pull-etcd-integration-1-cpu-arm64
pull-etcd-integration-2-cpu-amd64
pull-etcd-integration-2-cpu-arm64
pull-etcd-integration-4-cpu-amd64
pull-etcd-integration-4-cpu-arm64
pull-etcd-release-tests
pull-etcd-robustness-amd64
pull-etcd-robustness-arm64
pull-etcd-unit-test-386
pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64
pull-etcd-verify

In response to this:

/retest pull-etcd-release-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 30 '25 10:06 k8s-ci-robot

/retest pull-etcd-release-tests

serathius avatar Jun 30 '25 10:06 serathius

Missclick

serathius avatar Jun 30 '25 10:06 serathius

/retest

serathius avatar Jun 30 '25 13:06 serathius

pull-etcd-release-tests fails, looks like release script doesn't handle packages that were never released?

cc @ivanvc any guesses?

serathius avatar Jun 30 '25 13:06 serathius

go: updates to go.mod needed, disabled by -mod=readonly; to update it:
	go mod tidy

Please run go mod tidy in root and cache directories.

To confirm that it worked you can run make verify-gofmt

serathius avatar Jul 02 '25 12:07 serathius

go: updates to go.mod needed, disabled by -mod=readonly; to update it:
	go mod tidy

Please run go mod tidy in root and cache directories.

To confirm that it worked you can run make verify-gofmt

I’ve run go mod tidy in both the root and the cache directories, but looks like the “updates to go.mod needed” error still appear. make verify-gofmt also showed a PASSED and SUCCESS. Any idea of what is happening? Perhaps go.mod discrepancy?

apullo777 avatar Jul 02 '25 12:07 apullo777

I’ve run go mod tidy in both the root and the cache directories

You need to rebase on origin/main first. The CI doesn't test your exact commit, but a rebased on latest branch state.

serathius avatar Jul 02 '25 13:07 serathius

-release

@serathius, I had a hard time finding this message in this pull request; it has a lot of activity. I see that you suggested a change that will fix the release tests, but they are still failing. Before I investigate, is this still an issue? Please feel free to contact me on Slack.

ivanvc avatar Jul 03 '25 00:07 ivanvc

@serathius Sorry, I couldn't solve the rebase issue, and it now turns into a weird situation. I tried to do git rebase origin/main but it hit a strange issue: even with a clean working tree (both git status and git stash report no local edits), my git rebase aborts with merge conflicts with go.mod/sum files. The weird part is that Git seems to think those files are already up-to-date, so I cannot resolve the "invisible" conflict.

Stuck with this issue, I tried to force pushed a brand new branch onto the old PR branch name. That fixed the CI-side rebase error, but in doing so it also pulled in the root repo’s latest bump-Go commit (and its file changes) into my cache-mvp feature branch. So now those upstream modifications are being pushed as part of the commit, while they do not appear on my local, so I cannot remove them to update the PR commit.

At this point I can’t seem to remove them or get a clean history. Should I start over a new PR with a new branch?

apullo777 avatar Jul 03 '25 02:07 apullo777

-release

@serathius, I had a hard time finding this message in this pull request; it has a lot of activity. I see that you suggested a change that will fix the release tests, but they are still failing. Before I investigate, is this still an issue? Please feel free to contact me on Slack.

@ivanvc Thanks for taking a look! It appears the release test is now passing, so that part is resolved. But now my feature work are polluted I will try to figure out how to solve it. Sorry for the confusion...

apullo777 avatar Jul 03 '25 02:07 apullo777

Should I start over a new PR with a new branch?

No

serathius avatar Jul 03 '25 07:07 serathius

@serathius Sorry, I couldn't solve the rebase issue, and it now turns into a weird situation. I tried to do git rebase origin/main but it hit a strange issue: even with a clean working tree (both git status and git stash report no local edits), my git rebase aborts with merge conflicts with go.mod/sum files. The weird part is that Git seems to think those files are already up-to-date, so I cannot resolve the "invisible" conflict.

Yea, you broke git history. Somehow you changed the commit type to merge commit. I fixed it and pushed to your fork. Before making any more changes please fetch the branch locally.

serathius avatar Jul 03 '25 08:07 serathius

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: apullo777, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 03 '25 09:07 k8s-ci-robot

There are still a lot of things we can improve, but this looks like a good start. The most important thing next is fixing the Atomic watch property. The current implementation assumes that revision is a unique number, this is not true.

TXN can have multiple operations (Put/Delete) that are executed within one transactions, all this operations will share revision. Atomic watch guarantee ensures that events that share revisions will always be send in single watch response and not broken up. https://etcd.io/docs/v3.5/learning/api_guarantees/#watch-apis

Prepared PR to show what I mean https://github.com/etcd-io/etcd/pull/20272

serathius avatar Jul 03 '25 11:07 serathius