armada icon indicating copy to clipboard operation
armada copied to clipboard

E2e testsuite for Armada

Open dejanzele opened this issue 2 years ago • 1 comments

Test suite for Armada deployments. Run in either of the following ways:
go clean -testcache && ARMADA_EXECUTOR_INGRESS_PORT=5000 go test -v ./testsuite/...
ARMADA_EXECUTOR_INGRESS_PORT=5000 go run cmd/testsuite/main.go test --tests "./testsuite/testcases/*.yaml"

Omitting go clean -testcache causes test results to be cached, i.e., tests won't be re-run unless the code has changed,

Here, ARMADA_EXECUTOR_INGRESS_PORT=5000 indicates the Kubernetes ingress is handled on port 5000.
This is necessary to support ingress controllers running on ports other than 80.

When running with go test, use the ARMADA_TEST_FILES environment variable to override the pattern for test files to be loaded. It defaults to testcases/*.yaml.

Proposed Initial set of tests:

Test Steps Automatable
Submit a single Job that succeeds Submit Hello World Job1
Read EventsAssert Succeeded Event
Yes (easy)
Submit a single Job that fails to pull image Submit Job With Non-Existant Image
Read EventsAssert Job Failed Event
Yes (easy)
Submit a single Job that fails with OOM Submit Job with xMB memory limit with command line that uses > xMB
Read Events
Assert Job Failed
Yes (easy)
Submit a single job that succeeds close to memory limit Submit Job with xMB memory limit with command line that uses 90% xMB
Read Events
Assert Job Succeeded Event
Yes (easy)
Submit a job that uses fileshares Submit Job that reads and writes to file shares
Read Events
Assert Job succeeded and files exist.
Yes (easy)
Submit a duplicate job Submit Hello World Job with ClientId x
Submit Job again with same client ID
Assert that second response indicates duplicate
Subscribe to Events
Assert Job Duplicate and Succeeded Messages
Yes (easy)
Unschedulable Job Submit Job with invalid taint
Assert rejected
Yes (easy)
Batch Submit Submit 1000 Hello world jobs in a single batch
Read Events
Assert Job Succeeded
Yes (easy)
Performance Submit 1,000,000 jobs across multiple batches.
Read Events
Assert all succeeded in reasonable time frame.
Yes (hard as limited resources in dev might make this too slow- could use fake executor)
Performance Submit large number of jobs across may queues such that we generate x thousand messages per second.
Read Events
Assert all succeeded in reasonable time frame.
Yes (hard as limited resources in test env might make this too slow- could use fake executor)
Subscribe to middle of stream Submit Hello World Job
Read Events
Resubscribe to events with some offset.Assert Events as expected
yes (easy)
Cancel a job Submit Hello World Job with a long sleep
Cancel jobRead Events
Assert Job Cancelled Event
yes (easy)
Cancel lots of jobs Submit 1,000,000 jobs
Cancel jobs
Read Events
Assert Job Cancelled Event for each job
yes (easy if perf in dev ok)
Reprioritize Job Submit Hello World Job with a medium sleep
Reprioritize job
Read Events
Assert Job Reprioritized Event
Yes (easy- but only if asserting the event is good enough)
Reprioritize lots of jobs Submit 1,000,000 jobs
Reprioritize jobs
Read Events
Assert Job Reprioritized Event for each job
yes (easy if perf in dev ok)
Submit Job using python client Submit Hello World Job using python client
Read Events
Assert Succeeded Event
yes (easy)
Submit job using dotnet client Submit Hello World Job using dotnet client
Read Events
Assert Succeeded Event
yes (easy)
Lookout Run above tests
Assert jobs have shown up in lookout with correct state
No- Suggestion is to manually check lookout after the above tests and assert that results are as expected.
Binoculars Submit Hello World Job with some logging
Go to lookout
Click on logs and assert you can see the job
No (needs to be manual test)
Ingress Submit Hello World Job with ingress that stops job when a http call made to ingress
Read events and pull out ingress info
Make call to ingressAssert Succeeded Event
yes (easy/medium)
HA Submit job every half second for 5 mins
Kill one of the armada components (api, lookout ingester, pulsar, redis)
Subscribe to events
Assert all jobs succeeded
Yes (medium/hard)
Postgres failover Submit jobs
Kill postgres master
See that lookout keeps working
No
Cluster Failover and job timeout Have two clusters active
Submit Jobs
Kill one cluster
See that lost jobs are rescheduled
See that new jobs are submitted to good cluster
Yes (hard)
Queue Permissions Submit to queue that user is not permissioned toAssert that submit is rejected Yes (easy)
Namespace permissions Submit to queue that user is permissioned to but under namespace that they are not permissioned to.
Subscribe to eventsSee job failed event
yes (easy?)
Long running Subscription Submit hello world jobs over long time period (hours)
Subscribe to events over this period
Check that all events are as expected
Repeat for c# and python clients
Yes but hard because dev/staging clusters might not be stable enough?
Queue Ops Create a queue
Check the Queue can be listed
Submit to the queue
Check that the job succeeds
Delete the queue
Check the queue can no longer be listed
Submit job to queue
Check that submit is rejected
yes (easy?)
Podspec too Large Submit a job with a podspec > 64KB
See that the job is rejected
yes (easy)
Too Much ephemeral storage Submit a job that request storage but writes more data to disk than it asks for
Check that the job fails
yes (easy)

dejanzele avatar Jun 30 '22 15:06 dejanzele

dejanzele avatar Jun 30 '22 15:06 dejanzele