model_fn and input_fn called multiple times
I am using the prebuilt SageMaker SKLearn container (https://github.com/aws/sagemaker-scikit-learn-container) version 0.20.0. In the entry_point,I include a script which carries out the batch transformation job.
def model_fn():
...
def input_fn():
...
def predict_fn():
'''
A long running process to preprocess the data before calling model
https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/
'''
time.sleep(60 * 11) # sleep for 11 mins to simulate a long running process
....
def output_fn():
....
I noticed that the model_fn() was called multiple times in the cloudwatch log
21:11:43 model_fn called /opt/ml/model 0.3710819465747405
21:11:43 model_fn called /opt/ml/model 0.1368146211634631
21:11:44 model_fn called /opt/ml/model 0.09153953459183728
The input_fn() was also called multiple times
20:41:31 input_data <class 'str'> application/json 0.3936440317990033 {
20:51:30 input_data <class 'str'> application/json 0.4852180186010707 {
21:01:30 input_data <class 'str'> application/json 0.9954036507047136 {
21:11:30 input_data <class 'str'> application/json 0.0806271844985188 {
Precisely, it's called every 10 minutes.
I used ml.m4.xlarge, BatchStrategy = SingleRecord and SplitType of None. I also used the environmental variable SAGEMAKER_MODEL_SERVER_TIMEOUT = '9999' to overcome the 60s timeout. I expected that the model_fn or input_fn would only be called once, but in this case, they were called multiple times. In the end, the container crashed with "Internal Server Error".
I saw a similar related issue before https://github.com/awslabs/amazon-sagemaker-examples/issues/341 where the model_fn was called on each invocation. But in this case, there is no /invocations, the model_fn, input_fn, predict_fn, and output_fn were called multiple time. In the end, the container crashed with Internal Server Error.
How did you resolve this please, as I am getting the same issue.
Same issue here =/
Same issue here. If model_fu provides functionality of loading model, do me need to load it for every batch?
Same issue here ..!!! Anyone found a solution to this.?
How is this issue solved. same issue here too..
Has anyone found a solution? I'm facing the same issue, the function runs 4 times, it seems like 1 time per GPU available.
Can you show your code? I would like to reproduce it
Is there any update on this? It seems there's a problem with sagemaker-inference-toolkit. huggingface-sagemaker-inference-toolkit has the same issue: https://github.com/aws/sagemaker-huggingface-inference-toolkit/issues/133
In general as far as I'm aware, it's expected that the model_fn will be called multiple times because the default behaviour is for the server to load multiple copies of your model and use those to serve concurrent requests on multiple worker threads.
I've worked pretty closely with SageMaker but am not part of their core inference engineering team, so the following is based on an imperfect (and potentially outdated) understanding:
I believe both the sagemaker-scikit-learn-container and sagemaker-huggingface-inference-toolkit (for Hugging Face DLCs) use AWSLabs multi-model-server for their base inference server. The core sagemaker-inference-toolkit depends on it too as mentioned in the readme, but I know other DLCs like PyTorch and TensorFlow have been using their own ecosystems' serving stacks TorchServe and TFX.
It does make sense for the stack to support multiple worker threads so you can effectively utilize resources like instances with multiple GPUs, or a large number of CPU cores - and in general the stack should be configurable, but (IMO) it's a bit difficult to navigate with the serving stacks for these containers being split across so many different layers of code repository...
To explicitly control/limit the number of worker threads created to best utilize the hardware, I'd suggest trying environment variables:
SAGEMAKER_MODEL_SERVER_WORKERS(as per SM Inference Toolkit parameters.py)MMS_DEFAULT_WORKERS_PER_MODEL,MMS_NETTY_CLIENT_THREADS, and possibly alsoMMS_NUMBER_OF_NETTY_THREADS(as per MMS configuration doc and underlying ConfigManager)
input_fn being called multiple times for a single request is more concerning as that seems like a retry. You may have to set MMS-specific timeout & payload size configurations if the SAGEMAKER_ one isn't getting picked up. For example, in the past for large-payload/long-time inference on Hugging Face v4.28 container, I used MMS_DEFAULT_RESPONSE_TIMEOUT, MMS_MAX_REQUEST_SIZE, and MMS_MAX_RESPONSE_SIZE.
Hope this helps, but it'd be great to hear from anybody who manages to clarify exactly which env vars are sufficient to control the number of model workers spawned on these containers.
Thanks a lot @athewsey. Your points are very useful.