coreos-assembler kola can hit InsufficientInstanceCapacity when running AWS tests

We've hit this in the pipeline when running AWS aarch64 tests:

17:28:20 2022-08-05T21:28:19Z kola: retryloop: failed to bring up machines: error running instances: InsufficientInstanceCapacity: We currently do not have sufficient c7g.xlarge capacity in the Availability Zone you requested (us-east-1e). Our system will be working on provisioning additional capacity. You can currently get c7g.xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f.

We should either:

stop tying our instances to a specific availability zone and let AWS select one for us
auto-detect this error and try an alternative availability zone
some other option possible?

Aug 08 '22 14:08 jlebon

Some commits of interest:

https://github.com/coreos/coreos-assembler/commit/1817ead0a13527fe78f674755fc11de085ef466b
https://github.com/coreos/coreos-assembler/commit/a12a54949eb716da5e58d79d9c3a767a0760c71e

Aug 08 '22 14:08 jlebon

@jlebon @dustymabe: I was going over various options to tackle this.

Initially, I though we could just add in an additional filter(something similar to https://github.com/coreos/coreos-assembler/commit/1817ead0a13527fe78f674755fc11de085ef466b) to check whether an instance is available at that moment in the zone that we select for our testing. It turns out there's no such filter/functionality(AFAIK) which can check that before we actually run the test.
If we get an InsufficientInstanceCapacity error, we could retry the test on a different zone but a possible problem with that would be that there's still no guarantee that the other zone would have that instance available at that time.

Are there any other ways or should I continue to implement the second idea?

Aug 10 '22 20:08 gursewak1997

@gursewak1997 I was wondering if we could specify the VPC instead of the subnet (subnet's are specific to a particular availability zone) when making the request. Then given the instance type and VPC, AWS would figure out an availability zone that was appropriate.

I'm sure either:

this doesn't work
we need to specify a subnet for another reason

but it would be good to at least investigate that route.

Aug 11 '22 00:08 dustymabe

Bigger picture, I think we need to get away from running all the tests all the time. IOW, it may make sense to do a build, but then have tests run asynchronously/nonblocking on that build.

And once you go to that model, it makes a lot of sense to use e.g. spot instances for testing - and to do so with flexible requests for region/hardware.

Aug 19 '22 16:08 cgwalters

I believe the AWS tests are already non-blocking. But also, we've spent a lot of time arguing that quality suffers when tests are non-blocking and run after the fact, so wouldn't that be a substantial change in policy?

Aug 19 '22 16:08 bgilbert

I believe the AWS tests are already non-blocking.

Indeed. AWS tests (any cloud tests, actually) are forked off and don't block the main pipeline from finishing. We do look at the failures, though, and we require for our official releases that the tests for clouds pass before moving to the next step.

Aug 19 '22 20:08 dustymabe

I'm arguing something more nuanced; e.g. "tier 1" tests should be blocking on all platforms. "tier 2" tests might sometimes be run on some platforms, etc.

Aug 22 '22 21:08 cgwalters

coreos-assembler coreos-assembler copied to clipboard

kola can hit InsufficientInstanceCapacity when running AWS tests

coreos-assembler
coreos-assembler copied to clipboard