Distributed Execution Proposal

Open na-- opened this issue 2 years ago • 1 comments

I know that https://github.com/grafana/k6/issues/140 already exists! :sweat_smile: However, that issue has a lot history, unrelated discussion and comments, so at this point it isn't very suitable for actually tracking the work that has been done and that remains to be done on the topic...

https://github.com/grafana/k6/issues/140 is also a generic tracking issue for the "distributed execution" capability and will remain open until that is natively supported by k6, however that happens. While I needed some central place that allows me to link up and explain all of the parts of my specific proposal for how to get there. More specifically, this issue is for tracking the full implementation and delivery of a stable version of the distributed execution PoC that originally started as https://github.com/grafana/k6/pull/2438 and was later further developed in https://github.com/grafana/k6/pull/2816. From the design document that explains it, to the PRs that implement parts of it, to the missing parts that haven't been implemented yet. This issue is for all of that, and I'll just link to it in a https://github.com/grafana/k6/issues/140 comment :sweat_smile:

I will also be taking a long vacation/sabbatical for the next few months and then gradually moving to another team. Because of that (and somewhat prompted by @ragnarlonn's comments in #140 :sweat_smile:), I thought it made sense to try and write down my thoughts and ideas on the topic. It's very unlikely that I will be able to finish the distributed execution work myself, so I've tried to make what exists as code and ideas in my head as easy to adopt and built upon as it's practical to do before I go away for a while.

Since I won't be around (I might not have been as diligent otherwise... :sweat_smile:), I've written a design document and refactored the original proofs of concept into backwards-compatible (and so, hopefully merge-able) PRs :crossed_fingers:

Here is a (hopefully mostly complete) list of tasks that remain:

### Planning
- [x] [Refactor the original PoC](https://github.com/grafana/k6/pull/2816#issuecomment-1635466149) so that HDR histograms are the last commit and distributed execution can be split off into multiple atomic and self-sufficient commits/PRs that are completely backwards compatible.
- [x] Merge minor prerequisite refactoring commits (https://github.com/grafana/k6/pull/3191) and update [xk6-output-prometheus-remote](https://github.com/grafana/xk6-output-prometheus-remote) to resolve conflicts with API usage of internal k6 APIs (https://github.com/grafana/xk6-output-prometheus-remote/pull/133, https://github.com/grafana/k6/pull/3210).
- [x] Agree and adopt (i.e. merge) the design document https://github.com/grafana/k6/pull/3217. Or at least do so provisionally, if no major issues are found with it and nothing better is suggested by someone else. Since all of this should be considered experimental until stated otherwise, so it could be discarded if a better approach is found later, even if some of the PRs were already merged.

### Experimental version
- [x] Merge https://github.com/grafana/k6/pull/3204.
- [ ] Add Error handling via a dedicated pull request to #3205
- [ ] Add a good enough level of test coverage via a dedicated pull request to #3205
- [ ] Merge https://github.com/grafana/k6/pull/3205.
- [ ] Add support for mutual gRPC authentication between `k6 agent` and `k6 coordinator`

### Integrate with other k6 products
- [ ] Work on transitioning [k6-operator](https://github.com/grafana/k6-operator/)
- [ ] Use the experimental version on [Grafana Cloud k6](https://grafana.com/products/cloud/k6/) for the new distributed execution

### Local metrics
- [ ] Support Cloud metrics output (aka `k6 run -o cloud script.js`). It requires probably the implementation of #3282.
- [ ] Implement HDR/sparse histogram support (https://github.com/grafana/k6/issues/763); https://github.com/grafana/k6/pull/2816 has a PoC (moved as the last commit), but [more research is definitely needed](https://github.com/grafana/k6/issues/763#issuecomment-1059090120), I picked the library I used without too much scrutiny. Or implement the Open Telemetry output if it is confirmed as a solution.
- [ ] Support Thresholds and End-to-end test result refining https://github.com/grafana/k6/pull/3213

### UX refinement
- [ ] Add more flexibility beyond `--instance-count` parameter, to specify how a test is segmented between instances and how each instance is matched to each execution segment. Hopefully with some thoughts on how that can work together with test suites (https://github.com/grafana/k6/issues/1342, also in the design doc) down the line, or at least not interfere.

I've written these tasks in the order I think it makes sense to do them to get the end result with the least amount of effort and time. Though a lot of them can also be done out of order (e.g. HDR histograms can be done first, as was the case originally in the PoC, or tests can be added before some existing PR is merged :blush: :sweat_smile:).

This whole issue and items in the list above can be reordered, removed, added to and checked off at the discretion of whoever picks up this work. And this whole issue can be closed if the approach is considered nonviable and a better alternative exists, without affecting https://github.com/grafana/k6/issues/140.

Jul 20 '23 10:07 na--

Hi, we have a similar set of requirements, where we want to spawn up X number of machines for hitting and trying to choke the network of 1 server. Sadly - due to the network interfaces of the target machine being north of 20Gbps capable, we aren't able to choke the server/network using only 1 client machine. All I want is to be able to run 4 EC2 instances hitting 1 target server and aggregate reports from K6-Agents in one single Grafana dashboard/HTML report. Any plans to take this proposal live without Kubernetes (I know about k6-operator already)?

Jul 18 '24 06:07 dhairav