aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

[FIX]: router constructor config init, add gw-plugin metrics server, simplify & enable vtc e2e test

Open Venkat2811 opened this issue 7 months ago • 3 comments

Pull Request Description

Discovered critical bug while writing benchmark.

  • [x] bug fix in vtc-basic router constructor
  • [x] fix & simplify e2e test
  • [x] expose metrics in gateway-plugins correctly
  • [x] sample dashboard for gateway plugin vtc-basic routing algo

benchmark will be in follow up PR as this one is already large.

Venkat2811 avatar May 20 '25 14:05 Venkat2811

@Venkat2811 Integration test failed, can you please check the error.

image

varungup90 avatar May 20 '25 17:05 varungup90

@varungup90 There was a bug in constructor init of vtc-basic router that was not caught bcos of options based init in tests. I found this while writing benchmarks.

reproduced and fixed it (https://github.com/vllm-project/aibrix/pull/1112/commits/851b7661dbcf72bf399dd6d3c421ed953b0ad13c, https://github.com/vllm-project/aibrix/pull/1112/commits/bd441b09ac6ff23cf745172045b88fb4a186c0fe, https://github.com/vllm-project/aibrix/pull/1112/commits/50367e3f5d8b1c0126ea841d745575f7379b40ed)

could you pls review this PR now ? ~~benchmark will follow in separate PR~~

Venkat2811 avatar May 29 '25 23:05 Venkat2811

I will also help take a look this weekend

Jeffwan avatar May 30 '25 16:05 Jeffwan

@Jeffwan @varungup90 I've updated this PR to include benchmark notebook as well -

[x] custom benchmark results in notebook similar to this

  • I have grouped users based on token len from ShareGPT dataset for fairness eval. Don't want to add this capability to existing benchmark framework as there is some effort, but also don't want to pollute main. See this branch
  • Raw results used by notebook is available here
  • See notebook

When you have sometime pls review.

Venkat2811 avatar Jun 11 '25 08:06 Venkat2811

update: rebased with main

Venkat2811 avatar Jun 18 '25 19:06 Venkat2811

apologies for late response. I was busy with some events recently and I will finish review by EOD.

Jeffwan avatar Jun 19 '25 00:06 Jeffwan

/cc @gemini-code-assist

Jeffwan avatar Jun 19 '25 00:06 Jeffwan

Hi @Jeffwan, thanks for the update!

As you dive into the review, here's a quick overview of the key changes in this PR to help you focus:

  • VTC Router Bug Fix: The PR includes a fix for a bug in the vtc-basic router constructor (pkg/plugins/gateway/algorithms/vtc/token_tracker.go) that was discovered during benchmark writing. There's also a new test case (TestTokenTrackerWindowSizeThroughConstructor) to cover this.
  • Gateway Plugin Metrics: New Prometheus metrics are exposed for the gateway plugins, along with a new Grafana dashboard (observability/grafana/AIBrix_Envoy_Gateway_Plugins_Dashboard.json) for visualization. This involves changes in cmd/plugins/main.go, config/gateway/gateway-plugin/gateway-plugin.yaml, and the introduction of pkg/metrics/server.go.
  • Improved E2E Tests: The vtc_routing_test.go has been significantly updated. It's now enabled by default (removed flaky build tags), includes more robust tests for token-based routing and fairness under high utilization, and uses a more structured TestMain for Redis client management.
  • Development Workflow: New Makefile targets (dev-install-in-kind, dev-port-forward, etc.) and a scripts/port-forward.sh script have been added to streamline local development and testing with Kind.

Given the nature of these changes, you might want to pay particular attention to:

  • The correctness of the VTC router fix and the new test's coverage.
  • The Prometheus metric exposure and the Grafana dashboard for clarity and usefulness.
  • The new e2e test cases in vtc_routing_test.go to ensure they accurately reflect the desired routing behavior and cover relevant scenarios.

Let me know if you have any questions as you go through it!

gemini-code-assist[bot] avatar Jun 19 '25 00:06 gemini-code-assist[bot]

overall looks good to me. @Venkat2811 BTW, some of the features in this PR seem generic enough to be extracted as standalone components. Personally, I think it would be cleaner to split this into three separate PRs. If you're okay with that, could you help break it down? Of course, if you feel there's too much overlap between parts, we can still proceed with reviewing and merging it as a single PR.

Here’s a possible breakdown:

  • Gateway plugin extension: metrics server support + dashboard
  • Tooling updates: port-forward support, Makefile changes
  • VTC-related changes

Thanks for review @Jeffwan. Sure, will change this to draft and split into separate PRs and also address comments

Venkat2811 avatar Jun 19 '25 07:06 Venkat2811

split into separate PRs as agreed:

  • https://github.com/vllm-project/aibrix/pull/1210
  • https://github.com/vllm-project/aibrix/pull/1211

Venkat2811 avatar Jun 20 '25 00:06 Venkat2811

ready for review @Jeffwan

Venkat2811 avatar Jun 20 '25 00:06 Venkat2811

@Venkat2811 seems this one have some conflicts with main branch. could you rebase the main and resolve it. The change looks good to me

Jeffwan avatar Jun 20 '25 01:06 Jeffwan

@Jeffwan Done 👍🏽

Venkat2811 avatar Jun 20 '25 12:06 Venkat2811

@Jeffwan is there anything else pending in this PR ? Looking to get this merged.

Venkat2811 avatar Jun 23 '25 05:06 Venkat2811

@Venkat2811 I tried to merge this morning but notice it has conflict with upstream. could you help resolve it? (Looks like my earlier comment didn’t go through) image

Jeffwan avatar Jun 23 '25 06:06 Jeffwan

Got it @Jeffwan, updated now. I had to Merge branch 'main' into vtc-router-e2e-tests bcos of too many commits and rebase conflicts. If squash and merge still shows rebase issues:

  • I'm okay with you doing a merge commit
  • If you don't want to do that, I'll create new PR https://github.com/vllm-project/aibrix/pull/1222 & we can close this

Venkat2811 avatar Jun 23 '25 06:06 Venkat2811

@Venkat2811 I see you already cut https://github.com/vllm-project/aibrix/pull/1222 , we can move to #1222. Once the integrate test pass, I will merge it and we can close this one

Jeffwan avatar Jun 23 '25 07:06 Jeffwan

changes are merged in https://github.com/vllm-project/aibrix/pull/1222

Venkat2811 avatar Jun 23 '25 22:06 Venkat2811