daniel-salib
daniel-salib
@bbrowning thanks for catching that - yeah I saw function calls were using it so I carried it over, but realize we're not pairing the mcp calls to any output...
thanks for the review @alecsolder ! I was able to remove the extra checks and simplify the logic. Also updated unit tests and caught an issue with multi-turn mcp requests...
will create new PRs breaking the current PR into smaller chunks
Thanks for the reviews :) Made another pass taking all the feedback into consideration
thanks for the review @youngkent resolved all the comments and added a unit test for the /load route. Also attached the benchmark results showing the latency comparison before and after...
@youngkent ran benchmark_serving.py twice with and without the middleware and added to the PR description. Is it safe to assume the discrepancy between the runs is due to random +/-...
@youngkent updated the PR description with the benchmark comparison across 20,000 requests per group
@youngkent I was previously using the random dataset when benchmarking - thought that may have an effect on the variance. I updated the description with the results after I switched...
Thanks for the review @robertgshaw2-redhat I ran the performance test on 2 x H100 GPU
@youngkent took a different approach that should be much better performance-wise. Updated the description to include the latest benchmarks