egeria
egeria copied to clipboard
Add Compliance Test Suite to automated build
I suggest we add running the compliance test suite against the in-memory repository as part of our once-daily build
At a later point in time we can consider if this could be run against other metadata repositories.
I presume this is still a requirement @mandy-chessell @cmgrote @grahamwallis
Yes please
At a later point in time we can consider if this could be run against other metadata repositories.
There's an an initial version of this already part of our Helm charts for other metadata repositories... Just set the following in the values.yaml of the vdc chart (by default it's set to false), and it should create and configure a CTS instance for each repository being deployed as part of the chart:
# Egeria Conformance Test Suite - sets up to run against all Egeria repositories (if enabled=true)
cts:
enabled: true
(I think this is probably our best option, since it will require such an external repository to first exist -- probably not something that will ever be part of our automated Jenkins builds, particularly for proprietary / licensed repositories?)
In addition to running the cts, the results should be shared in some way - for example through some kind of build artifact, so that a consuming organisation could refer back to the CTS results for a shipped release - as well as developers being able to see CTS results from each build run.
These could then be linked to from release notes/top level readme
For 1.2 I will plan to execute the CTS & post the results at, or with a link from, the GitHub releases page. Experience learnt will be used to help refine the requirements for 1.3 where automation will be targeted
CTS is now looking good in 1.2, but for this release the run is done semi manually (ie via notebook). Automation will follow in a subsequent release
I am starting to prototype some CI/CD definitions for automated helm chart deployment. Initially this is to azure with a very limited sub as a POC, and initially with a basic notebook deployment (less moving parts), but will a) work on fuller sub b) add cts once some initial proof points are complete.
I'll link the necessary PRs for the CI/CD definitions here. Some of the changes are in base egeria, others are done directly through azure pipelines.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
The charts worked well for release testing on graph & inmem (after updated for inmem & openshift).
Next step would be to look at running automatically within a ci/cd workflow, probably triggered off a schedule
We could take something like the KinD workflow from https://github.com/odpi/egeria-database-connectors/blob/main/.github/workflows/connector-test.yml - obviously there's more to do..
Also referenced lately during the 3.7 release https://github.com/odpi/egeria/issues/6341
A few simple things we could check
- number of tests failed is 0
- number of tests successful is > ???? (current number? just big? could be checked in)
- no exceptions in audit log
- profile results compare to baseline (which could be checked in)
Better would be to properly map the tests to test suites ie much more fine grained, but this is likely substantial development
CTS failures can take a while to investigate, so automation could pick up issues a lot quicker - for example by running daily.
One concern is whether the 7GB agent running KinD would have enough resource to complete the tests
An initial version of this is now being tested at https://github.com/planetf1/cts Still debugging, but expected:
The github action will
- run 2 parallel jobs - for inmem & graph
- Install k8s within a container (KinD)
- setup & install our cts chart
- wait for results
- capture and post the results as an attachment
Caveats
- Manual trigger only (for testing)
- 'name' of job is based on connector which is a long name - needs parsing to something simpler
- Personal repo - just to get started quickly (need to discuss in team where this belongs)
- Need to consider scheduling - daily? Triggers? dependencies?
- How to report / ensure results looked at - slack?
- Need some simple analysis of the results for pass/fail (ie # tests, exceptions etc) (maybe split out from test)
- Hardcoded to 3.14-SNAPSHOT (may benefit from a tag for latest release)
cc: @cmgrote @mandy-chessell @lpalashevski
After 4.5 hours, the CTS is still running (even at the minimum size (5 vs 1) - even in memory (and graph takes longer) -> https://github.com/planetf1/cts/actions/runs/3695246683/jobs/6257391400
I set the job timeout to 5 hours (max for github is 6 - then job gets killed)
We have 2 CPUs, 7GB ram but it may be we cannot 'fit' the cts into this resource.
If not we need additional infrastructure - one of
- An enterprise github account (can use larger github hosted runners)
- External runners (need to deploy on our own infrastructure, then install github client code.). Could be k8s but needs resource/funding
- skip github actions and use external resource directly - as above
Or we figure out how to speed up Egeria/CTS significantly.
I'll check the config further & try and debug via ssh just in case of any errors, and extend timeout closer to 6h
All worth looking at -- when I run locally these days it's against 20GB memory and 3 cores, and at a size of 2. I think it finishes within 3 hours or less (for XTDB).
So my first hunch would be that 7GB and 2 cores is probably too small (7GB maybe the main culprit -- could it just be hitting a non-stop swapping scenario?)
I usually run on a 3-6 x 16GB cluster ... Though often multiple instances in parallel (all the charts)
I have run locally in around a 6-8GB but indeed this config may sadly be too small.
I'm going to take a closer look if o can get an ssh session setup
Two projects to setup github actions runners on a k8s cluster:
https://github.com/evryfs/github-actions-runner-operator https://github.com/actions-runner-controller/actions-runner-controller
The latter is being taken over by github for native support https://github.com/actions/actions-runner-controller/discussions/2072
Investigated external runners -- but hit issues with KinD. commented in actions-runner-controller discussion.
Reverted to debugging github runners. The following fragment assisted with debugging (see https://github.com/lhotari/action-upterm for more info):
=== debug
- name: Setup upterm session
uses: lhotari/action-upterm@v1
with:
## limits ssh access and adds the ssh public key for the user which triggered the workflow
limit-access-to-actor: true
The issue turned out to be that the strimzi operator pod was not starting due to failing to meet cpu constraints. This defaulted to '500m' (0.500 cpu units) which should have been ok. However ever 1m failed to schedule. this looks like a KinD issue, but overriding the min cpu to '0m' allowed the pods to schedule. This was needed for our own pods too.
Added additional checks. For example:
until kubectl get pod -l app.kubernetes.io/name=kafka -o go-template='{{.items | len}}' | grep -qxF 1; do
echo "Waiting for pod"
sleep 1
done
This fragment simply loops until the pod matching expression exists. (kubectl rollout status may also be useful)
Then we can do
kubectl wait pods --selector=app.kubernetes.io/name=kafka --for condition=Ready --timeout=10m
This will immediately return if the pod matching expression doesn't exist, which is why the above check is needed first.
All of these checks don't help running the cts as such, but rather help report the current stage in the github actions job log.
If CTS works we can revisit better approaches, custom actions etc.
SUCCESSFUL test run -> https://github.com/planetf1/cts/actions/runs/3708502295 - ie tasks completed as successful
results are attached to the job.
WIll elaborate the job to do some basic checks of the results.
Example output I'm experimenting with
This is based on positive/negative evidence counts in the details cts results ie:
➜ graph ./cts-analyze.py
Metadata sharing MANDATORY_PROFILE CONFORMANT_FULL_SUPPORT [ 71657 / 0 ]
Reference copies OPTIONAL_PROFILE NOT_CONFORMANT [ 8496 / 32 ]
Metadata maintenance OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 14126 / 0 ]
Dynamic types OPTIONAL_PROFILE UNKNOWN_STATUS [ 0 / 0 ]
Graph queries OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 528 / 0 ]
Historical search OPTIONAL_PROFILE CONFORMANT_NO_SUPPORT [ 530 / 0 ]
Entity proxies OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 2759 / 0 ]
Soft-delete and restore OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 2592 / 0 ]
Undo an update OPTIONAL_PROFILE CONFORMANT_NO_SUPPORT [ 406 / 0 ]
Reidentify instance OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 2650 / 0 ]
Retype instance OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 16365 / 0 ]
Rehome instance OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 1590 / 0 ]
Entity search OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 62878 / 0 ]
Relationship search OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 8253 / 0 ]
Entity advanced search OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 44800 / 0 ]
Relationship advanced search OPTIONAL_PROFILE CONFORMANT_FULL_SUPPORT [ 9312 / 0 ]
FAIL [246942/32]
➜ graph echo $?
1
This returns a simple pass/fail - based on whether any assertions have failed. It does not (yet?) compare to a baseline
There are many other interpretations we could do of the data, and format the evidence, check for other exceptions in log. Having experimented, refactoring could be a lot neater.
Added checks into latest pipeline. Set default container to 'latest' Added schedule, daily
I have reverted the doubling of the retry count used during CTS after seeing run-times on the CTS automation exceed 6 hours. Analysis of the cts execution is needed, but perhaps we were hitting many more of these time limits than I'd expected during even successful execution.
See https://github.com/odpi/egeria/pull/7314 -- need to test it through ci/cd to get an exact comparison
I'm proposing to move my repo under odpi. Whilst no doubt we can make improvements, refactor, it's a starting point, and moving it will make it easier for others to use it, review test results, improve CTS, and improve our test infrastructure.
Having backed of the timer increase, CTS is now running in 4-4.5 hours. will leave like this
The development work for this is complete