armada
armada copied to clipboard
E2e testsuite for Armada
Test suite for Armada deployments. Run in either of the following ways:
go clean -testcache && ARMADA_EXECUTOR_INGRESS_PORT=5000 go test -v ./testsuite/...
ARMADA_EXECUTOR_INGRESS_PORT=5000 go run cmd/testsuite/main.go test --tests "./testsuite/testcases/*.yaml"
Omitting go clean -testcache causes test results to be cached, i.e., tests won't be re-run unless the code has changed,
Here, ARMADA_EXECUTOR_INGRESS_PORT=5000 indicates the Kubernetes ingress is handled on port 5000.
This is necessary to support ingress controllers running on ports other than 80.
When running with go test, use the ARMADA_TEST_FILES environment variable to override the pattern for test files to be loaded. It defaults to testcases/*.yaml.
Proposed Initial set of tests:
Test | Steps | Automatable |
---|---|---|
Submit a single Job that succeeds | Submit Hello World Job1 Read EventsAssert Succeeded Event |
Yes (easy) |
Submit a single Job that fails to pull image | Submit Job With Non-Existant Image Read EventsAssert Job Failed Event |
Yes (easy) |
Submit a single Job that fails with OOM | Submit Job with xMB memory limit with command line that uses > xMB Read Events Assert Job Failed |
Yes (easy) |
Submit a single job that succeeds close to memory limit | Submit Job with xMB memory limit with command line that uses 90% xMB Read Events Assert Job Succeeded Event |
Yes (easy) |
Submit a job that uses fileshares | Submit Job that reads and writes to file shares Read Events Assert Job succeeded and files exist. |
Yes (easy) |
Submit a duplicate job | Submit Hello World Job with ClientId x Submit Job again with same client ID Assert that second response indicates duplicate Subscribe to Events Assert Job Duplicate and Succeeded Messages |
Yes (easy) |
Unschedulable Job | Submit Job with invalid taint Assert rejected |
Yes (easy) |
Batch Submit | Submit 1000 Hello world jobs in a single batch Read Events Assert Job Succeeded |
Yes (easy) |
Performance | Submit 1,000,000 jobs across multiple batches. Read Events Assert all succeeded in reasonable time frame. |
Yes (hard as limited resources in dev might make this too slow- could use fake executor) |
Performance | Submit large number of jobs across may queues such that we generate x thousand messages per second. Read Events Assert all succeeded in reasonable time frame. |
Yes (hard as limited resources in test env might make this too slow- could use fake executor) |
Subscribe to middle of stream | Submit Hello World Job Read Events Resubscribe to events with some offset.Assert Events as expected |
yes (easy) |
Cancel a job | Submit Hello World Job with a long sleep Cancel jobRead Events Assert Job Cancelled Event |
yes (easy) |
Cancel lots of jobs | Submit 1,000,000 jobs Cancel jobs Read Events Assert Job Cancelled Event for each job |
yes (easy if perf in dev ok) |
Reprioritize Job | Submit Hello World Job with a medium sleep Reprioritize job Read Events Assert Job Reprioritized Event |
Yes (easy- but only if asserting the event is good enough) |
Reprioritize lots of jobs | Submit 1,000,000 jobs Reprioritize jobs Read Events Assert Job Reprioritized Event for each job |
yes (easy if perf in dev ok) |
Submit Job using python client | Submit Hello World Job using python client Read Events Assert Succeeded Event |
yes (easy) |
Submit job using dotnet client | Submit Hello World Job using dotnet client Read Events Assert Succeeded Event |
yes (easy) |
Lookout | Run above tests Assert jobs have shown up in lookout with correct state |
No- Suggestion is to manually check lookout after the above tests and assert that results are as expected. |
Binoculars | Submit Hello World Job with some logging Go to lookout Click on logs and assert you can see the job |
No (needs to be manual test) |
Ingress | Submit Hello World Job with ingress that stops job when a http call made to ingress Read events and pull out ingress info Make call to ingressAssert Succeeded Event |
yes (easy/medium) |
HA | Submit job every half second for 5 mins Kill one of the armada components (api, lookout ingester, pulsar, redis) Subscribe to events Assert all jobs succeeded |
Yes (medium/hard) |
Postgres failover | Submit jobs Kill postgres master See that lookout keeps working |
No |
Cluster Failover and job timeout | Have two clusters active Submit Jobs Kill one cluster See that lost jobs are rescheduled See that new jobs are submitted to good cluster |
Yes (hard) |
Queue Permissions | Submit to queue that user is not permissioned toAssert that submit is rejected | Yes (easy) |
Namespace permissions | Submit to queue that user is permissioned to but under namespace that they are not permissioned to. Subscribe to eventsSee job failed event |
yes (easy?) |
Long running Subscription | Submit hello world jobs over long time period (hours) Subscribe to events over this period Check that all events are as expected Repeat for c# and python clients |
Yes but hard because dev/staging clusters might not be stable enough? |
Queue Ops | Create a queue Check the Queue can be listed Submit to the queue Check that the job succeeds Delete the queue Check the queue can no longer be listed Submit job to queue Check that submit is rejected |
yes (easy?) |
Podspec too Large | Submit a job with a podspec > 64KB See that the job is rejected |
yes (easy) |
Too Much ephemeral storage | Submit a job that request storage but writes more data to disk than it asks for Check that the job fails |
yes (easy) |