sieve icon indicating copy to clipboard operation
sieve copied to clipboard

What creates mask.json?

Open robbie-demuth opened this issue 2 years ago • 12 comments

I've been trying to port a controller to test it using Sieve. I've stumbled across a variety of minor issues, but have thus far managed to overcome them, so I won't talk about them here. Now, however, I'm at an issue that I'm not sure how to resolve. I've been following https://github.com/sieve-project/sieve/blob/main/docs/port.md and am currently at

First run Sieve learning stage

python3 sieve.py -p your-controller -t your-test-case-name -s learn -m learn-twice

Sieve appears to properly deploy my controller and execute my test case. Once the test case finishes executing, however, Sieve fails because it cannot find a mask.json file:

wait for final grace period 50 seconds
Generating controller family list...
Generating state update summary...
Generating end state...
Sanity checking the sieve log log/appian-operator/recreate/learn/learn-once/learn.yaml/sieve-server.log...
[FAIL] cannot find mask.json

The error appears to be stemming from here. What creates this file? Is it possible that an earlier step created it but I've since deleted it? Is the file meant to be created manually? I see no reference to it in the docs and it appears that the file isn't manually created based on its appearance in other examples.

Thanks!

PS: I would share the code, but the operator I'm porting is (currently) closed source. Please let me know what information might be useful and I'll try to share if I can!

robbie-demuth avatar Jun 20 '22 19:06 robbie-demuth

Hi @robbie-demuth , thank you for your interest in Sieve!

For the mask.json issue, could you check whether Sieve continues running after reporting this error and generates a mask.json in examples/your-controller-name/oracle/your-test-name?

Option learn-twice means Sieve is going to run the same test workload twice and it generates the mask.json only in its second run. Typically the error only happens in the first run and is benign. You can ignore the error if Sieve generates the mask.json at the end of the second run. The error is indeed misleading and I will fix it. Please let me know if you still cannot find mask.json at the end of learn-twice run.

marshtompsxd avatar Jun 20 '22 21:06 marshtompsxd

Hey, @marshtompsxd 👋 thanks for the quick response! I'll give that a shot today. When I saw the "failure" and saw Sieve delete and recreate the kind cluster, I figured it was just straight up retrying. Given how long it takes to run the tests on my machine and how resource intensive they are, I stopped Sieve instead of letting it continue. I'll report back on how things go 👍

robbie-demuth avatar Jun 21 '22 13:06 robbie-demuth

It took a while, but I was finally able to run

python3 sieve.py -p your-controller -t your-test-case-name -s learn -m learn-twice

I'm now on to trying to run the test plans and had a few comments / questions:

Sieve will generate the test plans for intermediate-states, unobserved-states and stale-stateing testing patterns in log/your-controller/your-test-case-name/learn/learn-twice/{intermediate-state, unobserved-states, stale-state}

The test plans are actually generated in log/your-controller/your-test-case-name/learn/learn-twice/learn.yaml/{intermediate-state, unobserved-states, stale-state} (note the learn.yaml)

If you want to run one of the test plans:

python3 sieve.py -p your-controller -t your-test-case-name -s test -m intermediate-state -c path-to-the-test-plan

I don't think -m intermediate-state is valid. Running the above command resulted in

Usage: python3 sieve.py [options]

sieve.py: error: invalid test mode option: intermediate-state

From the code, it looks like -m should be set to test

    parser.add_option(
        "-m",
        "--mode",
        dest="mode",
        help="MODE: vanilla, test, learn-once, learn-twice",
        metavar="MODE",
    )

Sieve generated hundreds of test plans (which I suppose is a good thing). Where do I go from here? Should I run all of them to test for bugs? Given how long it takes my test case to execute, I don't know if this is practical

Thanks for the assistance!

robbie-demuth avatar Jun 24 '22 15:06 robbie-demuth

Hi @robbie-demuth , thanks for your valuable feedback!

Regarding 1 and 2, I am sorry that the documentation is not updated, and thank you for finding the bugs in the documentation! I will update the documentation this week.

Regarding 3, yes generating many test plans is a good sign. Usually, you can start by running all the test plans. Sieve supports running all the test plans in one batch by

python3 sieve.py -p your-controller -t your-test-case-name -s test -c the-folder-contains-your-test-plans --batch 

All the test results will appear in sieve_test_results folder as json files. I just pushed a script for selecting the test results that might indicate a bug and you can run it by python3 report_bugs.py (please pull). You can refer to the test results pointed to by this script. Each test result file should have sufficient information for debugging including the full path to the test plan and the detected inconsistencies. You can also rerun a particular test plan to reproduce a bug by

python3 sieve.py -p your-controller -t your-test-case-name -s test -c your-test-plan

marshtompsxd avatar Jun 24 '22 23:06 marshtompsxd

Given how long it takes my test case to execute, I don't know if this is practical

This indeed could be a concern when you have many test plans. The estimated testing time is (number of test plans) * (time of running the test case once). We prepared a script test_script/runtest.sh to parallelize test runs on multiple machines. If you have more than one machine to run the tests you can consider the following

cd test_script
bash runtest.sh your-controller-name

All the test results will be gathered in a test_script/test-summary-xxx.json. Note that the script was for our internal testing and there could be some usability issues. It assumes the following:

  1. you have a hosts file including all the machines to run the tests like the following (the first : means the local host)
:
vm2
vm3
  1. you have a remotehosts that is the same as hosts except that there is no local host
vm2
vm3
  1. All the machines have sieve downloaded at /home/ubuntu/sieve -- you can for sure modify this file path in runtest.sh
  2. All the machines should be configured with the proper environment to run Sieve. You can consider using our ansible scripts in deploy_script to configure your machine, but note that these scripts will install some software like docker and kind.

marshtompsxd avatar Jun 24 '22 23:06 marshtompsxd

Thank you for using Sieve to test your controller again! And I apologize for all the usability issues you have encountered. I will update the documentation and also include the steps to run and diagnose tests this week.

marshtompsxd avatar Jun 24 '22 23:06 marshtompsxd

Hey, @marshtompsxd,

Good news - I was able to run a single test plan on a remote machine (not by running runtest.sh, but by running python3 sieve.py -p your-controller -t your-test-case-name -s test -m test -c path-to-the-test-plan on the machine itself)! I'm now running all of the test plans in one batch using the command you provided and will check back in a day or two

In the mean time, I was hoping to get some feedback on the test results from the single test plan I ran:

{
    "appian-operator": {
        "recreate": {
            "test": {
                "log/appian-operator/recreate/learn/learn-twice/learn.yaml/intermediate-state/intermediate-state-test-plan-1.yaml": {
                    "duration": 1387.2117638587952,
                    "injection_completed": true,
                    "workload_completed": true,
                    "number_errors": 51,
                    "detected_errors": [
                        "End state inconsistency - fewer objects than reference: controllerrevision/default/mysql-767bc79d77 is seen after reference run, but not seen after testing run",
                        "End state inconsistency - more objects than reference: 2 pod object(s) seen after reference run ['pod/default/appian-operator-controllers-6cb49b8499-vzjzc', 'pod/default/mysql-0'] but 17 pod object(s) seen after testing run ['pod/default/appian-k8s-service-manager-analytics00-shutdown-22k6j', 'pod/default/appian-k8s-service-manager-analytics01-shutdown-gp66z', 'pod/default/appian-k8s-service-manager-analytics02-shutdown-t5cc5', 'pod/default/appian-k8s-service-manager-channels-shutdown-gwjtj', 'pod/default/appian-k8s-service-manager-content-shutdown-vpf9n', 'pod/default/appian-k8s-service-manager-download-stats-shutdown-lj8l4', 'pod/default/appian-k8s-service-manager-execution00-shutdown-t829m', 'pod/default/appian-k8s-service-manager-execution01-shutdown-skqv5', 'pod/default/appian-k8s-service-manager-execution02-shutdown-9ndwn', 'pod/default/appian-k8s-service-manager-forums-shutdown-4mgvz', 'pod/default/appian-k8s-service-manager-groups-shutdown-kzz48', 'pod/default/appian-k8s-service-manager-notifications-email-shutdown-v4vdw', 'pod/default/appian-k8s-service-manager-notifications-shutdown-2thnz', 'pod/default/appian-k8s-service-manager-portal-shutdown-lt8fj', 'pod/default/appian-k8s-service-manager-process-design-shutdown-pvg2b', 'pod/default/appian-operator-controllers-574b5d5d5-xd47s', 'pod/default/mysql-0']",
                        "End state inconsistency - more objects than reference: 3 endpointslice object(s) seen after reference run ['endpointslice/default/appian-operator-controllers-metrics-dbb2r', 'endpointslice/default/mysql-headless-bnjc8', 'endpointslice/default/mysql-pfnxb'] but 27 endpointslice object(s) seen after testing run ['endpointslice/default/appian-k8s-8rczk', 'endpointslice/default/appian-k8s-data-server-headless-5d4fl', 'endpointslice/default/appian-k8s-kafka-bootstrap-crmdc', 'endpointslice/default/appian-k8s-kafka-headless-bvgmf', 'endpointslice/default/appian-k8s-search-server-headless-r759z', 'endpointslice/default/appian-k8s-service-manager-analytics00-headless-pq2sw', 'endpointslice/default/appian-k8s-service-manager-analytics01-headless-hrnbd', 'endpointslice/default/appian-k8s-service-manager-analytics02-headless-vvc44', 'endpointslice/default/appian-k8s-service-manager-channels-headless-rb2rw', 'endpointslice/default/appian-k8s-service-manager-content-headless-fqw7s', 'endpointslice/default/appian-k8s-service-manager-download-stats-headless-mv9dl', 'endpointslice/default/appian-k8s-service-manager-execution00-headless-2spt2', 'endpointslice/default/appian-k8s-service-manager-execution01-headless-98dw6', 'endpointslice/default/appian-k8s-service-manager-execution02-headless-hsrtc', 'endpointslice/default/appian-k8s-service-manager-forums-headless-ngfkk', 'endpointslice/default/appian-k8s-service-manager-groups-headless-w25vl', 'endpointslice/default/appian-k8s-service-manager-notifications-email-headless-wl8xh', 'endpointslice/default/appian-k8s-service-manager-notifications-headless-jk5tk', 'endpointslice/default/appian-k8s-service-manager-portal-headless-npdxp', 'endpointslice/default/appian-k8s-service-manager-process-design-headless-qdk92', 'endpointslice/default/appian-k8s-webapp-ggmw7', 'endpointslice/default/appian-k8s-webapp-headless-vbstd', 'endpointslice/default/appian-k8s-zookeeper-9tl5d', 'endpointslice/default/appian-k8s-zookeeper-headless-rfdqd', 'endpointslice/default/appian-operator-controllers-metrics-9brbm', 'endpointslice/default/mysql-headless-zlpdd', 'endpointslice/default/mysql-vnpft']",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-data-server-6f5dc69d4b is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-kafka-7d67884b9c is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-search-server-65769d754 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-analytics00-7896bbbf7 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-analytics01-5cccc964d9 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-analytics02-66fdb789c7 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-channels-797ffb84c7 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-content-b764bb74c is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-download-stats-5bd55c64d9 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-execution00-697ff49b85 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-execution01-688844c59c is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-execution02-6cdb99d857 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-forums-668988c4f7 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-groups-56db4c89c is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-notifications-7574b7cb6f is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-notifications-email-6d4dfb9595 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-portal-bc587f5d5 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-service-manager-process-design-79cdb78697 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-webapp-7f9df69d95 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/appian-k8s-zookeeper-84ddc5c99 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: controllerrevision/default/mysql-5fcb56bf98 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: job/default/appian-k8s-service-manager-notifications-email-shutdown is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: statefulset/default/appian-k8s-service-manager-analytics00 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: statefulset/default/appian-k8s-service-manager-analytics01 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: statefulset/default/appian-k8s-service-manager-analytics02 is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: statefulset/default/appian-k8s-service-manager-content is not seen after reference run, but seen after testing run",
                        "End state inconsistency - more objects than reference: statefulset/default/appian-k8s-service-manager-forums is not seen after reference run, but seen after testing run",
                        "End state inconsistency - object field has a different value: configmap/default/mysql[\"metadata\"][\"labels\"][\"helm.sh/chart\"] is mysql-9.1.7 after reference run, but mysql-9.1.8 after testing run",
                        "End state inconsistency - object field has a different value: endpoints/default/mysql-headless[\"metadata\"][\"labels\"][\"helm.sh/chart\"] is mysql-9.1.7 after reference run, but mysql-9.1.8 after testing run",
                        "End state inconsistency - object field has a different value: endpoints/default/mysql[\"metadata\"][\"labels\"][\"helm.sh/chart\"] is mysql-9.1.7 after reference run, but mysql-9.1.8 after testing run",
                        "End state inconsistency - object field has a different value: lease/default/3c0233a2.k8s.appian.com[\"spec\"][\"leaseTransitions\"] is 0 after reference run, but 1 after testing run",
                        "End state inconsistency - object field has a different value: pod/default/mysql-0[\"metadata\"][\"labels\"][\"controller-revision-hash\"] is mysql-767bc79d77 after reference run, but mysql-5fcb56bf98 after testing run",
                        "End state inconsistency - object field has a different value: pod/default/mysql-0[\"metadata\"][\"labels\"][\"helm.sh/chart\"] is mysql-9.1.7 after reference run, but mysql-9.1.8 after testing run",
                        "End state inconsistency - object field has a different value: secret/default/mysql[\"metadata\"][\"labels\"][\"helm.sh/chart\"] is mysql-9.1.7 after reference run, but mysql-9.1.8 after testing run",
                        "End state inconsistency - object field has a different value: service/default/mysql-headless[\"metadata\"][\"labels\"][\"helm.sh/chart\"] is mysql-9.1.7 after reference run, but mysql-9.1.8 after testing run",
                        "End state inconsistency - object field has a different value: service/default/mysql[\"metadata\"][\"labels\"][\"helm.sh/chart\"] is mysql-9.1.7 after reference run, but mysql-9.1.8 after testing run",
                        "End state inconsistency - object field has a different value: serviceaccount/default/mysql[\"metadata\"][\"labels\"][\"helm.sh/chart\"] is mysql-9.1.7 after reference run, but mysql-9.1.8 after testing run",
                        "End state inconsistency - object field has a different value: statefulset/default/mysql[\"metadata\"][\"labels\"][\"helm.sh/chart\"] is mysql-9.1.7 after reference run, but mysql-9.1.8 after testing run",
                        "End state inconsistency - object field has a different value: statefulset/default/mysql[\"spec\"][\"template\"][\"annotations\"][\"checksum/configuration\"] is a0676de8d34dab531ec9863de864962cf302d50067051139eaaba5b53861e32b after reference run, but e763e7839b4cb08448cabbc571bc740842a54fc30d3c9a33eb7652246e44ee8a after testing run",
                        "End state inconsistency - object field has a different value: statefulset/default/mysql[\"spec\"][\"template\"][\"labels\"][\"helm.sh/chart\"] is mysql-9.1.7 after reference run, but mysql-9.1.8 after testing run",
                        "End state inconsistency - object field has a different value: statefulset/default/mysql[\"status\"][\"currentRevision\"] is mysql-767bc79d77 after reference run, but mysql-5fcb56bf98 after testing run",
                        "End state inconsistency - object field has a different value: statefulset/default/mysql[\"status\"][\"updateRevision\"] is mysql-767bc79d77 after reference run, but mysql-5fcb56bf98 after testing run",
                        "State-update summaries inconsistency: job/default/appian-k8s-service-manager-notifications-email-shutdown DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run",
                        "State-update summaries inconsistency: statefulset/default/appian-k8s-service-manager-analytics00 DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run",
                        "State-update summaries inconsistency: statefulset/default/appian-k8s-service-manager-analytics01 DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run",
                        "State-update summaries inconsistency: statefulset/default/appian-k8s-service-manager-analytics02 DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run",
                        "State-update summaries inconsistency: statefulset/default/appian-k8s-service-manager-content DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run",
                        "State-update summaries inconsistency: statefulset/default/appian-k8s-service-manager-forums DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run"
                    ],
                    "no_exception": true,
                    "exception_message": "",
                    "test_config_content": "workload: recreate\nactions:\n- actionType: restartController\n  controllerLabel: appian-operator\n  trigger:\n    definitions:\n    - triggerName: trigger1\n      condition:\n        conditionType: onObjectUpdate\n        resourceKey: appian/default/appian-k8s\n        prevStateDiff: '{\"metadata\": {}}'\n        curStateDiff: '{\"metadata\": {\"finalizers\": [\"crd.k8s.appian.com/ordered-shutdown\"]}}'\n        occurrence: 1\n      observationPoint:\n        when: afterControllerWrite\n        by: gitlab.appian-stratus.com/appian/prod/appian-operator/controllers.(*AppianReconciler)\n    expression: trigger1\n",
                    "host": "sieve-01.appiancorp.com"
                }
            }
        }
    }
}

First, which is the reference run and which is the testing run? I'm assuming the reference run is what generated the test plans?

Second, it looks like there are a lot of false positives. A lot of the "End state inconsistency" results look like they're due to randomness in controller revision and pod names. It also looks like Sieve is picking up on changes to some resources not managed by our operator (the MySQL stuff). What do the "State-update summaries inconsistency" results indicate?

Thanks for all of your help on this!

robbie-demuth avatar Jun 27 '22 18:06 robbie-demuth

Hi @robbie-demuth , glad to hear that you managed to run the test plans!

First, which is the reference run and which is the testing run? I'm assuming the reference run is what generated the test plans?

The reference run is the learning phase you ran before (by python3 sieve.py -p your-controller -t your-test-case-name -s learn -m learn-twice). Sieve runs the same test case in the reference run and the test run, and the difference is that Sieve does not inject any fault to the reference run. And yes the test plans are generated from the reference run.

A lot of the "End state inconsistency" results look like they're due to randomness in controller revision and pod names. It also looks like Sieve is picking up on changes to some resources not managed by our operator (the MySQL stuff).

Yes, there seem to be many false alarms caused by random object names. The reason is that Sieve compares the end state (e.g., all the objects and their fields) of the two runs and generates an alarm for each inconsistency. If the object (like controllerrevision) has random name in every run, Sieve is not able to tell whether two objects with different names in two runs are actually logically the same object, and might generate false alarms.

For now, the best way to suppress the false alarm is to manually mask the objects you do not want to check, or the objects that are hard to check due to randomness. For example, you can mask all the controllerrevision objects as we did for the MongoDB operator: https://github.com/sieve-project/sieve/blob/684644d03a91863956fcd283d132beb1c84eb4a6/examples/mongodb-operator/config.json#L61 so that Sieve is not going to check any inconsistency on them. You can also make the mask more fine-grained like controllerrevision/default/appian-k8s-service-manager-notifications-*.

I also noticed that there seem to be false alarms caused by some random fields like mysql["status"]["currentRevision"]. You can also mask certain fields of an object like https://github.com/sieve-project/sieve/blob/684644d03a91863956fcd283d132beb1c84eb4a6/examples/mongodb-operator/config.json#L66-L69, where we masked ["spec"]["organization"] for the object.

What do the "State-update summaries inconsistency" results indicate?

"State-update summaries" is another checker that compares the number of object creation/deletion events that happened in the two runs. It complements the end-state checker because sometimes the controller goes wrong in the middle of a test run but eventually get to the correct state.

marshtompsxd avatar Jun 27 '22 21:06 marshtompsxd

I also noticed that there seem to be some interesting warnings like

"State-update summaries inconsistency: statefulset/default/appian-k8s-service-manager-analytics00 DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run",
"State-update summaries inconsistency: statefulset/default/appian-k8s-service-manager-analytics01 DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run",
"State-update summaries inconsistency: statefulset/default/appian-k8s-service-manager-analytics02 DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run",
"State-update summaries inconsistency: statefulset/default/appian-k8s-service-manager-content DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run",
"State-update summaries inconsistency: statefulset/default/appian-k8s-service-manager-forums DELETED inconsistency: 1 event(s) seen during reference run, but 0 seen during testing run"

I was assuming your test case is supposed to make the controller delete the statefulset (as it is in the reference run). If so, it might to interesting to see why in the test run the statefulset is never deleted.

marshtompsxd avatar Jun 27 '22 21:06 marshtompsxd

Hey, @marshtompsxd !

I'm circling back to this after being OOO for a while. I've got all of the results back now, but I didn't apply a mask b/c I didn't want to accidentally mask what may be bugs prematurely. Is there any way to apply a mask post-fact? I'm writing a script akin to report_bugs.py that suppresses false positives, but it's just scanning the detected errors. I tried looking at the Python code that applies masks during learning (?) and testing and got a bit lost.

Anyway, I think the biggest source of false positives is termination. My test case creates a custom resource, waits for the corresponding app to start, and deletes it. The custom resource has a finalizer so that the controller can do graceful shutdown. The controller gracefully scales down the stateful sets in an app-specific order, but does not delete them (or any other secondary resource for that matter). Secondary resources are garbage collected by Kubenetes once the custom resource has been deleted b/c of the owner references the controller sets up. My guess is that differences in how long the GC took between the reference run and the testing runs are causing a lot of the false positives. Any idea how to account for that (hopefully w/out re-learning / re-testing)? I'm having trouble determining if there are any real bugs based on the sheer number of what I think are false positives.

Thanks!

robbie-demuth avatar Jul 13 '22 20:07 robbie-demuth

Hi @robbie-demuth Thanks for letting me know about your test results.

Is there any way to apply a mask post-fact?

If you still keep your test raw data, that is, the files in log/, you can pull and run

python3 sieve.py -p your-controller -c test-plan-folder -b --phase=check

And in sieve_test_results folder you will have the new test result json updated by your new masks.

Secondary resources are garbage collected by Kubenetes once the custom resource has been deleted b/c of the owner references the controller sets up. My guess is that differences in how long the GC took between the reference run and the testing runs are causing a lot of the false positives. Any idea how to account for that (hopefully w/out re-learning / re-testing)?

From your description, the difference in whether the GC has deleted the secondary resources is very likely to be the reason of false alarms. If in one run the GC runs a bit faster and deletes all the resources, while in the other run the GC is a bit slow and does not delete anything, Sieve will detect many inconsistencies and report them as alarms (tho they might not be bugs).

This is actually a common problem when running end-to-end tests since one kubectl command might trigger many controller actions in the background (creation/update/deletion) and it is hard to estimate how long all these actions are finished and the cluster becomes stable again. The simple but effective approach we were using is to wait for certain resource getting created/deleted in the test case. For example we have some APIs to wait for certain conditions (terminated, running) of certain resources (pod, statefulset) in the test case like this: https://github.com/sieve-project/sieve/blob/684644d03a91863956fcd283d132beb1c84eb4a6/examples/rabbitmq-operator/test/test.py#L16.

Unfortunately, rerunning the learning/test is necessary if you modify the test case to add more wait-for-xxx, because Sieve will need the new trace (where secondary resources are correctly deleted by GC if there is no bug triggered) generated by the modified test case to detect bugs.

marshtompsxd avatar Jul 15 '22 03:07 marshtompsxd

The simple but effective approach we were using is to wait for certain resource getting created/deleted in the test case. For example we have some APIs to wait for certain conditions (terminated, running) of certain resources (pod, statefulset) in the test case like this:

@marshtompsxd this is probably a best practice we should mention in the documentation.

lalithsuresh avatar Jul 15 '22 03:07 lalithsuresh

Hi @robbie-demuth Please let us know if you encounter more difficulty in using Sieve or any improvement/feature you want to see in Sieve.

marshtompsxd avatar Aug 30 '22 14:08 marshtompsxd

Hey, @marshtompsxd. Will do 👍 I'll go ahead and close this for now

robbie-demuth avatar Aug 30 '22 14:08 robbie-demuth