syne-tune icon indicating copy to clipboard operation
syne-tune copied to clipboard

SageMaker ResourceLimitExceeded

Open austinmw opened this issue 2 years ago • 9 comments

Hi, I have a limit of 8 ml.g5.12xlarge instances, and although I set Tuner.n_workers = 5 I still got a ResourceLimitExceeded error. Is there a way to make sure that jobs are fully stopped when using SageMakerBackend before launching new ones?

Also, when using RemoteLauncher, in situations where the management instance does error out (for example due to ResourceLimitExceeded), is there a way to make sure the management instance sends a stop signal to all tuning jobs before exiting? Maybe something like:

try:
    # manage tuning jobs
except:
   # raise error
finally:
   # stop any trials still running

austinmw avatar May 25 '22 23:05 austinmw

Regarding your first point, I agree that it would be good to have this functionality as an option (it should probably be an option as I would assume most of the time users does not want to wait the instance to be released as it can take several minutes).

Regarding the second point, this behavior should already be implemented as the tuner should stop all jobs before exciting even when an error occur, see [here].

geoalgo avatar May 31 '22 09:05 geoalgo

I agree that the first point would allow users to better anticipate how large of a limit they need to request and the maximum number of workers they can use. I'm using 4/8 of my limit currently to make sure I don't hit a resource error.

And ah okay, I must not have realized that stop signal was being sent, thanks!

austinmw avatar May 31 '22 12:05 austinmw

Hi David, we may be able to do something about this. The issue is that once the scheduler returns STOP, a trial is marked as stopped (which is the right thing to do, because the scheduler assumes that), but then of course it may take some time for the backend to really get the resource back.

I think it would be hard for the Tuner to figure out every time how many resources (for new trials) are really available. But given that the backend's start_trial (or, more precisely, _schedule) returns a status whether the new trial could really be scheduled, I think it is not hard to work out a clean solution. This would include a suitable timeout if a user really does not have the required quotas, so that certain trials new get started.

mseeger avatar Jun 06 '22 15:06 mseeger

@geoalgo I can work on a solution for this one. This would allow the backend's start_trial to signal that the requested trial cannot be scheduled right now. The Tuner would then have a mechanism to try again in the next round. However, if a trial cannot be scheduled after some timeout, we'd then throw an exception.

mseeger avatar Jun 08 '22 09:06 mseeger

Hi Matthias, sorry for the delay. It would be good to discuss different options before sending a PR if possible. For instance, a simple solution could be to just throw a capacity exception from the backend start_trial and have an option to ignore it in the tuner. However, the main difficulty is not the implementation but the testing given that this affects only sagemaker backend for which we dont have integration test yet.

geoalgo avatar Jun 08 '22 09:06 geoalgo

Hi David, but then we'd have to signal to the scheduler that the requested trial has not been started. Right now, schedulers assume that once suggest is called, a trial is started.

My feeling is this can be tested by mocking the backend.

In the end, asking users to set n_workers to much lower than their limit, just because of the brief stopping overhead, is not really good.

mseeger avatar Jun 08 '22 09:06 mseeger

Hi David, but then we'd have to signal to the scheduler that the requested trial has not been started. Right now, schedulers assume that once suggest is called, a trial is started.

True, but we have the callback on_trial_error that is intended for this type of situations.

In the end, asking users to set n_workers to much lower than their limit, just because of the brief stopping overhead, is not really good.

To be clear, this is not the alternative I am proposing (which is to just have an option to not propagate capacity exception and let the tuner continue in such situations).

geoalgo avatar Jun 08 '22 12:06 geoalgo

I looked more into this:

  • Calling on_trial_error is not the right thing to do, this marks the corresponding config as invalid, and the scheduler should avoid nearby configs in the future. Our situation is not an error, just the job cannot be scheduled right now
  • The scheduler registers the config as pending already when it is returned by suggest. This is important to make batch suggestions work, where suggest is called B times in a row

My proposal is just to decouple suggest calls (which may be done even if there are no resources right now, due to stopping being delayed) from scheduling the corresponding trials. If a trial for a certain suggested config cannot be started right now, you preferentially try next time. My feeling is this is cleaner than discarding the suggestion (which the scheduler has already committed to -- pending, register trial as running).

mseeger avatar Jun 09 '22 07:06 mseeger

Ok thanks, there are multiple options and I am keen on discussing into their trade-offs in term of simplicity and ease of maintenance before implementing a specific one (it would be an impacting change and the interactions and testing with different backend are currently tricky).

geoalgo avatar Jun 09 '22 09:06 geoalgo

NOTE: Also requested by Jessie Luk

mseeger avatar Aug 26 '22 11:08 mseeger

This request is related to another issue with the SageMaker backend, namely I sometimes get ThrottlingError's when the backend tries to start too many SM training jobs very close in time. This happens when n_workers = 16, or when several experiments are started at the same time. It happens just when the very first batch of n_workers jobs are started together.

But if we ran synchronous schedulers with SM backend, it would probably also happen.

We simply need a feature for the backend to re-try starting a trial.

mseeger avatar Sep 07 '22 09:09 mseeger

The ThrottlingError issue is resolved with #342.

It is not clear (to me) whether this fix (changing the boto3 retry policy) would also help against the ResourceLimitExceeded issue here.

In general, I think that just retrying for a while will not really help when the situation blocking jobs to be started remains in place for a while (say, minutes).

In such a case, it would be better for the backend to log the request in a list and return (non blocking), and when asked to do the next thing, try again to work off the list.

mseeger avatar Sep 09 '22 12:09 mseeger

This issue remains a problem due to long stop times for SM training jobs. This will likely be overcome, for our use case here, by SageMaker warmpooling, which we will support shortly. For this reason, I am closing this issue for now, but we are aware of this, and our goal is to make sure that you can use just as many workers with the SageMaker back-end as you have quotas to run.

mseeger avatar Oct 13 '22 08:10 mseeger

Reopening, since I think I have a fix for this one.

mseeger avatar Oct 18 '22 08:10 mseeger

This is solved by #389

mseeger avatar Oct 18 '22 10:10 mseeger