syne-tune
syne-tune copied to clipboard
SageMaker ResourceLimitExceeded
Hi, I have a limit of 8 ml.g5.12xlarge
instances, and although I set Tuner.n_workers = 5
I still got a ResourceLimitExceeded
error. Is there a way to make sure that jobs are fully stopped when using SageMakerBackend
before launching new ones?
Also, when using RemoteLauncher
, in situations where the management instance does error out (for example due to ResourceLimitExceeded), is there a way to make sure the management instance sends a stop signal to all tuning jobs before exiting? Maybe something like:
try:
# manage tuning jobs
except:
# raise error
finally:
# stop any trials still running
Regarding your first point, I agree that it would be good to have this functionality as an option (it should probably be an option as I would assume most of the time users does not want to wait the instance to be released as it can take several minutes).
Regarding the second point, this behavior should already be implemented as the tuner should stop all jobs before exciting even when an error occur, see [here].
I agree that the first point would allow users to better anticipate how large of a limit they need to request and the maximum number of workers they can use. I'm using 4/8 of my limit currently to make sure I don't hit a resource error.
And ah okay, I must not have realized that stop signal was being sent, thanks!
Hi David, we may be able to do something about this. The issue is that once the scheduler returns STOP, a trial is marked as stopped (which is the right thing to do, because the scheduler assumes that), but then of course it may take some time for the backend to really get the resource back.
I think it would be hard for the Tuner to figure out every time how many resources (for new trials) are really available. But given that the backend's start_trial (or, more precisely, _schedule
) returns a status whether the new trial could really be scheduled, I think it is not hard to work out a clean solution. This would include a suitable timeout if a user really does not have the required quotas, so that certain trials new get started.
@geoalgo
I can work on a solution for this one. This would allow the backend's start_trial
to signal that the requested trial cannot be scheduled right now. The Tuner would then have a mechanism to try again in the next round. However, if a trial cannot be scheduled after some timeout, we'd then throw an exception.
Hi Matthias, sorry for the delay. It would be good to discuss different options before sending a PR if possible. For instance, a simple solution could be to just throw a capacity exception from the backend start_trial and have an option to ignore it in the tuner. However, the main difficulty is not the implementation but the testing given that this affects only sagemaker backend for which we dont have integration test yet.
Hi David, but then we'd have to signal to the scheduler that the requested trial has not been started. Right now, schedulers assume that once suggest
is called, a trial is started.
My feeling is this can be tested by mocking the backend.
In the end, asking users to set n_workers
to much lower than their limit, just because of the brief stopping overhead, is not really good.
Hi David, but then we'd have to signal to the scheduler that the requested trial has not been started. Right now, schedulers assume that once suggest is called, a trial is started.
True, but we have the callback on_trial_error
that is intended for this type of situations.
In the end, asking users to set n_workers to much lower than their limit, just because of the brief stopping overhead, is not really good.
To be clear, this is not the alternative I am proposing (which is to just have an option to not propagate capacity exception and let the tuner continue in such situations).
I looked more into this:
- Calling
on_trial_error
is not the right thing to do, this marks the corresponding config as invalid, and the scheduler should avoid nearby configs in the future. Our situation is not an error, just the job cannot be scheduled right now - The scheduler registers the config as pending already when it is returned by
suggest
. This is important to make batch suggestions work, wheresuggest
is called B times in a row
My proposal is just to decouple suggest
calls (which may be done even if there are no resources right now, due to stopping being delayed) from scheduling the corresponding trials. If a trial for a certain suggested config cannot be started right now, you preferentially try next time. My feeling is this is cleaner than discarding the suggestion (which the scheduler has already committed to -- pending, register trial as running).
Ok thanks, there are multiple options and I am keen on discussing into their trade-offs in term of simplicity and ease of maintenance before implementing a specific one (it would be an impacting change and the interactions and testing with different backend are currently tricky).
NOTE: Also requested by Jessie Luk
This request is related to another issue with the SageMaker backend, namely I sometimes get ThrottlingError's when the backend tries to start too many SM training jobs very close in time. This happens when n_workers = 16, or when several experiments are started at the same time. It happens just when the very first batch of n_workers jobs are started together.
But if we ran synchronous schedulers with SM backend, it would probably also happen.
We simply need a feature for the backend to re-try starting a trial.
The ThrottlingError issue is resolved with #342.
It is not clear (to me) whether this fix (changing the boto3 retry policy) would also help against the ResourceLimitExceeded issue here.
In general, I think that just retrying for a while will not really help when the situation blocking jobs to be started remains in place for a while (say, minutes).
In such a case, it would be better for the backend to log the request in a list and return (non blocking), and when asked to do the next thing, try again to work off the list.
This issue remains a problem due to long stop times for SM training jobs. This will likely be overcome, for our use case here, by SageMaker warmpooling, which we will support shortly. For this reason, I am closing this issue for now, but we are aware of this, and our goal is to make sure that you can use just as many workers with the SageMaker back-end as you have quotas to run.
Reopening, since I think I have a fix for this one.
This is solved by #389