sieve icon indicating copy to clipboard operation
sieve copied to clipboard

Experience when working on a fix for time-travel bug in xtradb operator

Open laphets opened this issue 3 years ago • 7 comments

When I was working on to fix K8SPXC-725 and K8SPXC-763 , I have met the following two issues:

Firstly, sieve saves a snapshot for all the crd config files for a certain operator, and those config files can be easily divergent from the upstream. For example, when I tried to run my fixed xtradb operator again for the same workload, since my fix is checkouted from the main branch, which is the latest version of the xtradb operator, but the crd config files inside sieve is outdated for a previous version, and I need to spend some efforts to make those crd file stored in sieve sync with the upstream so that the xtradb operators can be successfully set up with the matched crd configs.

And this issues will also occur if developers want to replay bugs for their own version operators.


Secondly, also for xtradb, they have some tricky logic to specify an init image (image for some init container) for the xtradbcluster. Basically, if the version specified for xtradbcluster is same as the operator, the init image will be same as the operator image, e.g. xxx/xtradb-operator:time-travel

However, if the version for xtradbcluster (e.g. 1.7) is different from the operator, the init image will be xxx/xtradb-operator:1.7

And it is obvious that we do not build such image as xxx/xtradb-operator:1.7.

In that case, we also need to update all the workload’s config files (which specifies the version of xtradbcluster) to catch up with the version of the operator, so that the init image will be assigned same as the operator, otherwise, we may get some image pulling 404 error.

laphets avatar Jun 18 '21 17:06 laphets

Firstly, sieve saves a snapshot for all the crd config files for a certain operator

Yes. Basically you can only find bugs in a specific version, while the upstream is constantly evolving.

And this issues will also occur if developers want to replay bugs for their own version operators.

I think we should just tell them the version we find the bug.

It is possible that the new version does not even have the bug.

Basically, we don't have a fully automated way to test a given operator. If that's the case, then the problem will remain.

tianyin avatar Jun 18 '21 17:06 tianyin

In that case, we also need to update all the workload’s config files

So, I guess the fundamental issue is the manual effort one needs to pay to use Sieve.

Is there any ways to quantify the work?

tianyin avatar Jun 18 '21 17:06 tianyin

@laphets this is why I think the test cases, CRD/manifests and workloads should absolutely /not/ be part of Sieve like it is right now. Developers should be able to use their own version control for these things, and run Sieve against their code bases as is (even if it is not from a commit).

lalithsuresh avatar Jun 18 '21 17:06 lalithsuresh

In that case, we also need to update all the workload’s config files

So, I guess the fundamental issue is the manual effort one needs to pay to use Sieve.

Is there any ways to quantify the work?

Basically, they may need go through steps stated in port.md again for a different version / sha operator they want to test on. And they may also need to update workload related resource config files (we should clarify this) to fit for a different version operator (this is tricky, because those file need only be updated if there is any compatibility issue between different version of operator and workload resource, but this compatibility issue may not always be obvious).

laphets avatar Jun 19 '21 02:06 laphets

@laphets this is why I think the test cases, CRD/manifests and workloads should absolutely /not/ be part of Sieve like it is right now. Developers should be able to use their own version control for these things, and run Sieve against their code bases as is (even if it is not from a commit).

Agree. We currently keep the test cases and configs in the project just to make it easier to reproduce the bugs (especially in the final artifact evaluation phase). The final release of Sieve should only provide several interfaces to reuse/write test workloads. The current test cases and configs can be part of example if we want.

marshtompsxd avatar Jun 19 '21 03:06 marshtompsxd

In that case, we also need to update all the workload’s config files

So, I guess the fundamental issue is the manual effort one needs to pay to use Sieve.

Is there any ways to quantify the work?

To port a new operator, a developer needs to basically 1) add a few entries into controllers.py and 2) provide some necessary configs (like the CRD yaml files) and modify a few lines. To write a new test, a developer needs to add a few lines into controllers.py and workloads.py. We can basically count how many lines should be added/modified during this process, and the number should be very reasonable.

marshtompsxd avatar Jun 19 '21 03:06 marshtompsxd

We currently keep the test cases and configs in the project just to make it easier to reproduce the bugs (especially in the final artifact evaluation phase).

I think the best way to do it is to move them to other repos in this org.

Ideally, this repo is only the sieve tool. And for each operator, it could be a repo in which sieve is a git submodule.

tianyin avatar Jun 19 '21 06:06 tianyin