1click-hpc icon indicating copy to clipboard operation
1click-hpc copied to clipboard

LBInit issues

Open rvencu opened this issue 3 years ago • 2 comments

I bumped a while ago into LBInit issues, meaning when I delete a stack usually LBInit fails to delete. The workaround is to wait some more minutes then retry the stack delete and it works.

But today I started having problems with its creation. In the cloudwatch log I find this:

{
    "Status": "FAILED",
    "Reason": "See the details in CloudWatch Log Stream: 2022/06/22/[$LATEST]d883c58469a545d58f5fb61863e901ef",
    "PhysicalResourceId": "2022/06/22/[$LATEST]d883c58469a545d58f5fb61863e901ef",
    "StackId": "arn:aws:cloudformation:us-east-1:842865360552:stack/origtest/0cdfe300-f1fa-11ec-b068-121de38a7e19",
    "RequestId": "10fc583d-c908-41c1-af07-751ba3a4b563",
    "LogicalResourceId": "LBInit",
    "NoEcho": false,
    "Data": {
        "ClientErrorCode": "NoSuchEntity",
        "ClientErrorMessage": "The Server Certificate with name origtest-981587795.us-east-1.elb.amazonaws.com cannot be found."
    }
}

I have another HPC cluster active, with a different name, it should not interfere with the creation of another cluster in the account. The above error still appears with everything set on AUTO

rvencu avatar Jun 22 '22 07:06 rvencu

I started to debug the issue and found that all previous certificates were not deleted at rollback / delete stack. And I guess I hit a kind of limit because saving the certificate did not work anymore

Cleaned up old certificates the LBInit creation succeeded.

Of course, the error on LBInit deletion still needs to be addressed.

rvencu avatar Jun 22 '22 15:06 rvencu

noted! thanks

nicolaven avatar Jul 07 '22 13:07 nicolaven