aws: kola iam role can get into weird state if user doesn't have passrole perms
If I initially run kola without the iam passrole permissions then I end up with a instance profile that doesn't have a role associated with it.
In this case I run kola on a fresh account (no CreatedBy=mantle resources) and it fails because I didn't have passrole permissions:
[coreos-assembler]$ kola -p aws --aws-ami ami-0e884738127697eb9 --aws-region us-east-1 -b fcos run coreos.ignition.resource.s3
=== RUN coreos.ignition.resource.s3
--- FAIL: coreos.ignition.resource.s3 (0.32s)
harness.go:507: Cluster failed starting machines: error verifying IAM instance profile: adding role "kola" to instance profile "kola": AccessDenied: User: arn:aws:iam::013116697141:user/dusty-fcos is not authorized to perform: iam:PassRole on resource: role kola
status code: 403, request id: 08080270-c449-11e9-a7a4-a589214ff8db
FAIL, output in _kola_temp/aws-2019-08-21-1922-321
harness: test suite failed
Subsequent runs of kola won't re-attempt to fix the error (i.e. a kola role exists so continue on):
[coreos-assembler]$
[coreos-assembler]$ kola -p aws --aws-ami ami-0e884738127697eb9 --aws-region us-east-1 -b fcos run coreos.ignition.resource.s3
=== RUN coreos.ignition.resource.s3
--- FAIL: coreos.ignition.resource.s3 (351.12s)
harness.go:507: Cluster failed starting machines: machine "i-0d3611d49fc18f878" failed to start: ssh journalctl failed: dial tcp 52.201.248.149:22: connect: connection refused
) on machine i-0d3611d49fc18f878 consolening (fs/kernfs/dir.c:1503 kernfs_remove_by_name_ns+0x83/0x90
FAIL, output in _kola_temp/aws-2019-08-21-1925-336
harness: test suite failed
This test eventually fails because the there is no role in the instance profile:
$ curl http://169.254.169.254/latest/meta-data/iam/info
{
"Code" : "Success",
"Message" : "Instance Profile does not contain a role. Please see documentation at http://docs.amazonwebservices.com/IAM/latest/UserGuide/RolesTroubleshooting.html.",
"LastUpdated" : "2019-08-21T18:43:09Z",
"InstanceProfileArn" : "arn:aws:iam::00000000000:instance-profile/kola",
"InstanceProfileId" : "AIPARGFOZ5J262XIR3ZOJ"
}
I'm guessing we should either do a check for passrole early and fail before we even try to create the kola role, or we should check the instance profile later to make sure it contains a role before continuing. We could do both :)
At a minimum I think we need to make it clean up the resources if it experiences an error during initial creation. Probably also worth looking into if we can do a one-time per run check that the instance profile contains the role as with how it's currently laid out I think we'd end up checking once per cluster if not once per machine.
After I added passrole to my user and deleted the existing kola role I got a successful test:
$ kola -p aws --aws-ami ami-0e884738127697eb9 --aws-region us-east-1 -b fcos run coreos.ignition.resource.s3
=== RUN coreos.ignition.resource.s3
--- PASS: coreos.ignition.resource.s3 (85.21s)
PASS, output in _kola_temp/aws-2019-08-21-1937-351
This is "kola prep isn't idempotent" right?
I think it tries to be but is missing a step. It checks for the role but I don't think it checks the instance profile to make sure it contains the role.
@cgwalters the code currently checks if the InstanceProfile exists, if it does it immediately exits without checking the underlying roles, if it doesn't it attempts to create said underlying roles. This ended up leading to this failure case where the creation of the underlying roles failed but the Instance Profile existed.