armada E2e testsuite for Armada

E2e testsuite for Armada

Open dejanzele opened this issue 2 years ago • 1 comments

Test suite for Armada deployments. Run in either of the following ways:
go clean -testcache && ARMADA_EXECUTOR_INGRESS_PORT=5000 go test -v ./testsuite/...
ARMADA_EXECUTOR_INGRESS_PORT=5000 go run cmd/testsuite/main.go test --tests "./testsuite/testcases/*.yaml"

Omitting go clean -testcache causes test results to be cached, i.e., tests won't be re-run unless the code has changed,

Here, ARMADA_EXECUTOR_INGRESS_PORT=5000 indicates the Kubernetes ingress is handled on port 5000.
This is necessary to support ingress controllers running on ports other than 80.

When running with go test, use the ARMADA_TEST_FILES environment variable to override the pattern for test files to be loaded. It defaults to testcases/*.yaml.

Proposed Initial set of tests:

Test	Steps	Automatable
Submit a single Job that succeeds	Submit Hello World Job1 Read EventsAssert Succeeded Event	Yes (easy)
Submit a single Job that fails to pull image	Submit Job With Non-Existant Image Read EventsAssert Job Failed Event	Yes (easy)
Submit a single Job that fails with OOM	Submit Job with xMB memory limit with command line that uses > xMB Read Events Assert Job Failed	Yes (easy)
Submit a single job that succeeds close to memory limit	Submit Job with xMB memory limit with command line that uses 90% xMB Read Events Assert Job Succeeded Event	Yes (easy)
Submit a job that uses fileshares	Submit Job that reads and writes to file shares Read Events Assert Job succeeded and files exist.	Yes (easy)
Submit a duplicate job	Submit Hello World Job with ClientId x Submit Job again with same client ID Assert that second response indicates duplicate Subscribe to Events Assert Job Duplicate and Succeeded Messages	Yes (easy)
Unschedulable Job	Submit Job with invalid taint Assert rejected	Yes (easy)
Batch Submit	Submit 1000 Hello world jobs in a single batch Read Events Assert Job Succeeded	Yes (easy)
Performance	Submit 1,000,000 jobs across multiple batches. Read Events Assert all succeeded in reasonable time frame.	Yes (hard as limited resources in dev might make this too slow- could use fake executor)
Performance	Submit large number of jobs across may queues such that we generate x thousand messages per second. Read Events Assert all succeeded in reasonable time frame.	Yes (hard as limited resources in test env might make this too slow- could use fake executor)
Subscribe to middle of stream	Submit Hello World Job Read Events Resubscribe to events with some offset.Assert Events as expected	yes (easy)
Cancel a job	Submit Hello World Job with a long sleep Cancel jobRead Events Assert Job Cancelled Event	yes (easy)
Cancel lots of jobs	Submit 1,000,000 jobs Cancel jobs Read Events Assert Job Cancelled Event for each job	yes (easy if perf in dev ok)
Reprioritize Job	Submit Hello World Job with a medium sleep Reprioritize job Read Events Assert Job Reprioritized Event	Yes (easy- but only if asserting the event is good enough)
Reprioritize lots of jobs	Submit 1,000,000 jobs Reprioritize jobs Read Events Assert Job Reprioritized Event for each job	yes (easy if perf in dev ok)
Submit Job using python client	Submit Hello World Job using python client Read Events Assert Succeeded Event	yes (easy)
Submit job using dotnet client	Submit Hello World Job using dotnet client Read Events Assert Succeeded Event	yes (easy)
Lookout	Run above tests Assert jobs have shown up in lookout with correct state	No- Suggestion is to manually check lookout after the above tests and assert that results are as expected.
Binoculars	Submit Hello World Job with some logging Go to lookout Click on logs and assert you can see the job	No (needs to be manual test)
Ingress	Submit Hello World Job with ingress that stops job when a http call made to ingress Read events and pull out ingress info Make call to ingressAssert Succeeded Event	yes (easy/medium)
HA	Submit job every half second for 5 mins Kill one of the armada components (api, lookout ingester, pulsar, redis) Subscribe to events Assert all jobs succeeded	Yes (medium/hard)
Postgres failover	Submit jobs Kill postgres master See that lookout keeps working	No
Cluster Failover and job timeout	Have two clusters active Submit Jobs Kill one cluster See that lost jobs are rescheduled See that new jobs are submitted to good cluster	Yes (hard)
Queue Permissions	Submit to queue that user is not permissioned toAssert that submit is rejected	Yes (easy)
Namespace permissions	Submit to queue that user is permissioned to but under namespace that they are not permissioned to. Subscribe to eventsSee job failed event	yes (easy?)
Long running Subscription	Submit hello world jobs over long time period (hours) Subscribe to events over this period Check that all events are as expected Repeat for c# and python clients	Yes but hard because dev/staging clusters might not be stable enough?
Queue Ops	Create a queue Check the Queue can be listed Submit to the queue Check that the job succeeds Delete the queue Check the queue can no longer be listed Submit job to queue Check that submit is rejected	yes (easy?)
Podspec too Large	Submit a job with a podspec > 64KB See that the job is rejected	yes (easy)
Too Much ephemeral storage	Submit a job that request storage but writes more data to disk than it asks for Check that the job fails	yes (easy)

Jun 30 '22 15:06 dejanzele

armada armada copied to clipboard

E2e testsuite for Armada

armada
armada copied to clipboard