[FIX]: router constructor config init, add gw-plugin metrics server, simplify & enable vtc e2e test
Pull Request Description
Discovered critical bug while writing benchmark.
- [x] bug fix in
vtc-basicrouter constructor - [x] fix & simplify e2e test
- [x] expose metrics in gateway-plugins correctly
- [x] sample dashboard for gateway plugin vtc-basic routing algo
benchmark will be in follow up PR as this one is already large.
@varungup90 There was a bug in constructor init of vtc-basic router that was not caught bcos of options based init in tests. I found this while writing benchmarks.
reproduced and fixed it (https://github.com/vllm-project/aibrix/pull/1112/commits/851b7661dbcf72bf399dd6d3c421ed953b0ad13c, https://github.com/vllm-project/aibrix/pull/1112/commits/bd441b09ac6ff23cf745172045b88fb4a186c0fe, https://github.com/vllm-project/aibrix/pull/1112/commits/50367e3f5d8b1c0126ea841d745575f7379b40ed)
could you pls review this PR now ? ~~benchmark will follow in separate PR~~
I will also help take a look this weekend
@Jeffwan @varungup90 I've updated this PR to include benchmark notebook as well -
[x] custom benchmark results in notebook similar to this
When you have sometime pls review.
update: rebased with main
apologies for late response. I was busy with some events recently and I will finish review by EOD.
/cc @gemini-code-assist
Hi @Jeffwan, thanks for the update!
As you dive into the review, here's a quick overview of the key changes in this PR to help you focus:
- VTC Router Bug Fix: The PR includes a fix for a bug in the
vtc-basicrouter constructor (pkg/plugins/gateway/algorithms/vtc/token_tracker.go) that was discovered during benchmark writing. There's also a new test case (TestTokenTrackerWindowSizeThroughConstructor) to cover this. - Gateway Plugin Metrics: New Prometheus metrics are exposed for the gateway plugins, along with a new Grafana dashboard (
observability/grafana/AIBrix_Envoy_Gateway_Plugins_Dashboard.json) for visualization. This involves changes incmd/plugins/main.go,config/gateway/gateway-plugin/gateway-plugin.yaml, and the introduction ofpkg/metrics/server.go. - Improved E2E Tests: The
vtc_routing_test.gohas been significantly updated. It's now enabled by default (removedflakybuild tags), includes more robust tests for token-based routing and fairness under high utilization, and uses a more structuredTestMainfor Redis client management. - Development Workflow: New
Makefiletargets (dev-install-in-kind,dev-port-forward, etc.) and ascripts/port-forward.shscript have been added to streamline local development and testing with Kind.
Given the nature of these changes, you might want to pay particular attention to:
- The correctness of the VTC router fix and the new test's coverage.
- The Prometheus metric exposure and the Grafana dashboard for clarity and usefulness.
- The new e2e test cases in
vtc_routing_test.goto ensure they accurately reflect the desired routing behavior and cover relevant scenarios.
Let me know if you have any questions as you go through it!
overall looks good to me. @Venkat2811 BTW, some of the features in this PR seem generic enough to be extracted as standalone components. Personally, I think it would be cleaner to split this into three separate PRs. If you're okay with that, could you help break it down? Of course, if you feel there's too much overlap between parts, we can still proceed with reviewing and merging it as a single PR.
Here’s a possible breakdown:
- Gateway plugin extension: metrics server support + dashboard
- Tooling updates: port-forward support, Makefile changes
- VTC-related changes
Thanks for review @Jeffwan. Sure, will change this to draft and split into separate PRs and also address comments
split into separate PRs as agreed:
- https://github.com/vllm-project/aibrix/pull/1210
- https://github.com/vllm-project/aibrix/pull/1211
ready for review @Jeffwan
@Venkat2811 seems this one have some conflicts with main branch. could you rebase the main and resolve it. The change looks good to me
@Jeffwan Done 👍🏽
@Jeffwan is there anything else pending in this PR ? Looking to get this merged.
@Venkat2811 I tried to merge this morning but notice it has conflict with upstream. could you help resolve it? (Looks like my earlier comment didn’t go through)
Got it @Jeffwan, updated now. I had to Merge branch 'main' into vtc-router-e2e-tests bcos of too many commits and rebase conflicts. If squash and merge still shows rebase issues:
- I'm okay with you doing a merge commit
- If you don't want to do that, I'll create new PR https://github.com/vllm-project/aibrix/pull/1222 & we can close this
@Venkat2811 I see you already cut https://github.com/vllm-project/aibrix/pull/1222 , we can move to #1222. Once the integrate test pass, I will merge it and we can close this one
changes are merged in https://github.com/vllm-project/aibrix/pull/1222