Dogfood OpenEBS E2E failures by capturing useful information

Open vharsh opened this issue 3 years ago • 1 comments

trafficstars

Questions

What should be the goal of this tool? Should it just stick to pointing troubling areas or also dump data of the troubling areas and can that data be trusted at the face value?
How much logs should the tool collect, if at all. Just enough or all of it so that further debugging is done by grep-ing outputs in an editor of choice or some back-and-forth commands?
What should be a baseline assumption for this tool(it's turtles all the way down, which turtle should be this tool's last one)? Is it a good idea to assume, K8s is supposed to be healthy and is managed perfectly by the admin?

Right now we have super preliminary support for debugging Cstor volumes, it'd be good if we can think on something on the lines of debugging + creating a github issue + dogfooding, etc.
Right now, the cstor volume debugging, just points to places which seems off, it'd be good to sort of plan and implement, debugging in stages, i.e. it helps narrow down the search space by pointing out what's right, what isn't & what may not be
- Identify a list of things, which needs to be checked(is the storage-engine replicated?, should NDM agents failing affect this volume/pool?)
- K8s APIserver is up & healthy
- K8s kube-system components are up, is kubelet container(for certain setups) up, how does node-heartbeat look like for concerned nodes(are they alive and kicking, do they have any X-Pressure)?
- Networking isn't down(imp for replicated storage engines)
- Relevant OpenEBS components are up(as identified in step-1)
There are some limitations to the tool, it might be hard to figure out(at first) if the application is failing because of storage or vice versa.

While OpenEBSctl can show some information we generally ask our community users while interacting with them in a single shot and we plan to help them automatically create a GitHub issue via #39.
It might be a good ask to think of using the same tool to collect useful information on cluster-destruction, which is likely what happens when an E2E test fails. It might be useful as a replacement of a bunch of kubectl & shell commands.
To-be-decided-and-updated

Feb 02 '22 12:02 vharsh

I'll have a chat with the E2E team about how useful this can become & what more enhancements can help it get there.

Feb 28 '22 03:02 vharsh