egeria
egeria copied to clipboard

Published 20 hours ago •

Reame
Issues

Add Compliance Test Suite to automated build

Open planetf1 opened this issue 7 years ago • 14 comments

I suggest we add running the compliance test suite against the in-memory repository as part of our once-daily build

At a later point in time we can consider if this could be run against other metadata repositories.

Nov 06 '18 16:11 planetf1

I presume this is still a requirement @mandy-chessell @cmgrote @grahamwallis

Jul 19 '19 11:07 planetf1

Yes please

Jul 19 '19 12:07 mandy-chessell

At a later point in time we can consider if this could be run against other metadata repositories.

There's an an initial version of this already part of our Helm charts for other metadata repositories... Just set the following in the values.yaml of the vdc chart (by default it's set to false), and it should create and configure a CTS instance for each repository being deployed as part of the chart:

# Egeria Conformance Test Suite - sets up to run against all Egeria repositories (if enabled=true)
cts:
  enabled: true

(I think this is probably our best option, since it will require such an external repository to first exist -- probably not something that will ever be part of our automated Jenkins builds, particularly for proprietary / licensed repositories?)

Jul 23 '19 08:07 cmgrote

In addition to running the cts, the results should be shared in some way - for example through some kind of build artifact, so that a consuming organisation could refer back to the CTS results for a shipped release - as well as developers being able to see CTS results from each build run.

These could then be linked to from release notes/top level readme

Nov 11 '19 16:11 planetf1

For 1.2 I will plan to execute the CTS & post the results at, or with a link from, the GitHub releases page. Experience learnt will be used to help refine the requirements for 1.3 where automation will be targeted

Nov 19 '19 12:11 planetf1

CTS is now looking good in 1.2, but for this release the run is done semi manually (ie via notebook). Automation will follow in a subsequent release

Nov 29 '19 08:11 planetf1

I am starting to prototype some CI/CD definitions for automated helm chart deployment. Initially this is to azure with a very limited sub as a POC, and initially with a basic notebook deployment (less moving parts), but will a) work on fuller sub b) add cts once some initial proof points are complete.

I'll link the necessary PRs for the CI/CD definitions here. Some of the changes are in base egeria, others are done directly through azure pipelines.

Dec 16 '19 10:12 planetf1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

Oct 01 '20 00:10 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

Dec 03 '20 00:12 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

Feb 09 '21 00:02 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

Apr 24 '21 00:04 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

Jun 29 '21 00:06 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

Oct 20 '21 00:10 github-actions[bot]

The charts worked well for release testing on graph & inmem (after updated for inmem & openshift).

Next step would be to look at running automatically within a ci/cd workflow, probably triggered off a schedule

We could take something like the KinD workflow from https://github.com/odpi/egeria-database-connectors/blob/main/.github/workflows/connector-test.yml - obviously there's more to do..

Also referenced lately during the 3.7 release https://github.com/odpi/egeria/issues/6341

Apr 01 '22 12:04 planetf1

A few simple things we could check

number of tests failed is 0
number of tests successful is > ???? (current number? just big? could be checked in)
no exceptions in audit log
profile results compare to baseline (which could be checked in)

Better would be to properly map the tests to test suites ie much more fine grained, but this is likely substantial development

CTS failures can take a while to investigate, so automation could pick up issues a lot quicker - for example by running daily.

One concern is whether the 7GB agent running KinD would have enough resource to complete the tests

Dec 12 '22 09:12 planetf1

An initial version of this is now being tested at https://github.com/planetf1/cts Still debugging, but expected:

The github action will

run 2 parallel jobs - for inmem & graph
Install k8s within a container (KinD)
setup & install our cts chart
wait for results
capture and post the results as an attachment

Caveats

Manual trigger only (for testing)
'name' of job is based on connector which is a long name - needs parsing to something simpler
Personal repo - just to get started quickly (need to discuss in team where this belongs)
Need to consider scheduling - daily? Triggers? dependencies?
How to report / ensure results looked at - slack?
Need some simple analysis of the results for pass/fail (ie # tests, exceptions etc) (maybe split out from test)
Hardcoded to 3.14-SNAPSHOT (may benefit from a tag for latest release)

cc: @cmgrote @mandy-chessell @lpalashevski

Dec 14 '22 12:12 planetf1

After 4.5 hours, the CTS is still running (even at the minimum size (5 vs 1) - even in memory (and graph takes longer) -> https://github.com/planetf1/cts/actions/runs/3695246683/jobs/6257391400

I set the job timeout to 5 hours (max for github is 6 - then job gets killed)

We have 2 CPUs, 7GB ram but it may be we cannot 'fit' the cts into this resource.

If not we need additional infrastructure - one of

An enterprise github account (can use larger github hosted runners)
External runners (need to deploy on our own infrastructure, then install github client code.). Could be k8s but needs resource/funding
skip github actions and use external resource directly - as above

Or we figure out how to speed up Egeria/CTS significantly.

I'll check the config further & try and debug via ssh just in case of any errors, and extend timeout closer to 6h

Dec 14 '22 18:12 planetf1

All worth looking at -- when I run locally these days it's against 20GB memory and 3 cores, and at a size of 2. I think it finishes within 3 hours or less (for XTDB).

So my first hunch would be that 7GB and 2 cores is probably too small (7GB maybe the main culprit -- could it just be hitting a non-stop swapping scenario?)

Dec 14 '22 20:12 cmgrote

I usually run on a 3-6 x 16GB cluster ... Though often multiple instances in parallel (all the charts)

I have run locally in around a 6-8GB but indeed this config may sadly be too small.

I'm going to take a closer look if o can get an ssh session setup

Dec 14 '22 23:12 planetf1

Two projects to setup github actions runners on a k8s cluster:

https://github.com/evryfs/github-actions-runner-operator https://github.com/actions-runner-controller/actions-runner-controller

The latter is being taken over by github for native support https://github.com/actions/actions-runner-controller/discussions/2072

Dec 15 '22 08:12 planetf1

Investigated external runners -- but hit issues with KinD. commented in actions-runner-controller discussion.

Reverted to debugging github runners. The following fragment assisted with debugging (see https://github.com/lhotari/action-upterm for more info):

       === debug
      - name: Setup upterm session
        uses: lhotari/action-upterm@v1
        with:
          ## limits ssh access and adds the ssh public key for the user which triggered the workflow
          limit-access-to-actor: true

The issue turned out to be that the strimzi operator pod was not starting due to failing to meet cpu constraints. This defaulted to '500m' (0.500 cpu units) which should have been ok. However ever 1m failed to schedule. this looks like a KinD issue, but overriding the min cpu to '0m' allowed the pods to schedule. This was needed for our own pods too.

Added additional checks. For example:

          until kubectl get pod -l app.kubernetes.io/name=kafka -o go-template='{{.items | len}}' | grep -qxF 1; do
          echo "Waiting for pod"
          sleep 1
          done

This fragment simply loops until the pod matching expression exists. (kubectl rollout status may also be useful)

Then we can do

          kubectl wait pods --selector=app.kubernetes.io/name=kafka --for condition=Ready --timeout=10m

This will immediately return if the pod matching expression doesn't exist, which is why the above check is needed first.

All of these checks don't help running the cts as such, but rather help report the current stage in the github actions job log.

If CTS works we can revisit better approaches, custom actions etc.

Dec 15 '22 19:12 planetf1

SUCCESSFUL test run -> https://github.com/planetf1/cts/actions/runs/3708502295 - ie tasks completed as successful

results are attached to the job.

WIll elaborate the job to do some basic checks of the results.

Dec 16 '22 08:12 planetf1

Example output I'm experimenting with

This is based on positive/negative evidence counts in the details cts results ie:

➜  graph ./cts-analyze.py
              Metadata sharing MANDATORY_PROFILE   CONFORMANT_FULL_SUPPORT [  71657 /      0 ]
              Reference copies  OPTIONAL_PROFILE            NOT_CONFORMANT [   8496 /     32 ]
          Metadata maintenance  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [  14126 /      0 ]
                 Dynamic types  OPTIONAL_PROFILE            UNKNOWN_STATUS [      0 /      0 ]
                 Graph queries  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [    528 /      0 ]
             Historical search  OPTIONAL_PROFILE     CONFORMANT_NO_SUPPORT [    530 /      0 ]
                Entity proxies  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   2759 /      0 ]
       Soft-delete and restore  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   2592 /      0 ]
                Undo an update  OPTIONAL_PROFILE     CONFORMANT_NO_SUPPORT [    406 /      0 ]
           Reidentify instance  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   2650 /      0 ]
               Retype instance  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [  16365 /      0 ]
               Rehome instance  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   1590 /      0 ]
                 Entity search  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [  62878 /      0 ]
           Relationship search  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   8253 /      0 ]
        Entity advanced search  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [  44800 /      0 ]
  Relationship advanced search  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   9312 /      0 ]

FAIL [246942/32]
➜  graph echo $?         
1

This returns a simple pass/fail - based on whether any assertions have failed. It does not (yet?) compare to a baseline

There are many other interpretations we could do of the data, and format the evidence, check for other exceptions in log. Having experimented, refactoring could be a lot neater.

Dec 16 '22 15:12 planetf1

Added checks into latest pipeline. Set default container to 'latest' Added schedule, daily

Dec 16 '22 15:12 planetf1

I have reverted the doubling of the retry count used during CTS after seeing run-times on the CTS automation exceed 6 hours. Analysis of the cts execution is needed, but perhaps we were hitting many more of these time limits than I'd expected during even successful execution.

See https://github.com/odpi/egeria/pull/7314 -- need to test it through ci/cd to get an exact comparison

Jan 11 '23 18:01 planetf1

I'm proposing to move my repo under odpi. Whilst no doubt we can make improvements, refactor, it's a starting point, and moving it will make it easier for others to use it, review test results, improve CTS, and improve our test infrastructure.

Jan 12 '23 00:01 planetf1

Having backed of the timer increase, CTS is now running in 4-4.5 hours. will leave like this

Jan 12 '23 10:01 planetf1

The development work for this is complete

Jun 29 '23 08:06 mandy-chessell

Labels

consumability

testing

build-improvement

conformance-testing

pinned

Owner

Other Repo Issues