cml icon indicating copy to clipboard operation
cml copied to clipboard

MaxSpotInstanceCountExceeded in GL tests due to GPU

Open DavidGOrtega opened this issue 4 years ago • 4 comments

We have a difficult scenario having MaxSpotInstanceCountExceeded errors. Probably reuse will solve this

DavidGOrtega avatar Jul 27 '21 15:07 DavidGOrtega

If we enable the --reuse option, unit tests won't [always] cover the runner creation process. Do we want that?

0x2b3bfa0 avatar Jul 28 '21 20:07 0x2b3bfa0

@shcheklein found that there are multiple instances running for about a week. We need:

  • [ ] automatic warning (email etc.) from AWS if instances run for more than e.g. 30min
  • [ ] automatic shutdown of instances (timeout) by AWS
  • [ ] figure out why CML didn't cleanly terminate the instances

--reuse will only hide the problem without solving it

casperdcl avatar Aug 02 '21 11:08 casperdcl

related to #680

This might be related to #678 after seeying the logs seems that the chrono is not working properly

DavidGOrtega avatar Aug 02 '21 12:08 DavidGOrtega

also https://github.com/cloud-custodian/cloud-custodian (@dberenbaum suggestion)

casperdcl avatar Oct 04 '21 12:10 casperdcl

We haven't seen this in a while, closing for now.

dacbd avatar Feb 17 '23 15:02 dacbd