amazon-sagemaker-examples model_fn and input

I am using the prebuilt SageMaker SKLearn container (https://github.com/aws/sagemaker-scikit-learn-container) version 0.20.0. In the entry_point,I include a script which carries out the batch transformation job.

def model_fn():
    ...    

def input_fn():
    ...

def predict_fn():
    '''
        A long running process to preprocess the data before calling model
        https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/
    '''
    time.sleep(60 * 11) # sleep for 11 mins to simulate a long running process
    ....

def output_fn():
    ....

I noticed that the model_fn() was called multiple times in the cloudwatch log

21:11:43 model_fn called /opt/ml/model 0.3710819465747405
21:11:43 model_fn called /opt/ml/model 0.1368146211634631
21:11:44 model_fn called /opt/ml/model 0.09153953459183728

The input_fn() was also called multiple times

20:41:31 input_data <class 'str'> application/json 0.3936440317990033 {
20:51:30 input_data <class 'str'> application/json 0.4852180186010707 {
21:01:30 input_data <class 'str'> application/json 0.9954036507047136 {
21:11:30 input_data <class 'str'> application/json 0.0806271844985188 {

Precisely, it's called every 10 minutes.

I used ml.m4.xlarge, BatchStrategy = SingleRecord and SplitType of None. I also used the environmental variable SAGEMAKER_MODEL_SERVER_TIMEOUT = '9999' to overcome the 60s timeout. I expected that the model_fn or input_fn would only be called once, but in this case, they were called multiple times. In the end, the container crashed with "Internal Server Error".

I saw a similar related issue before https://github.com/awslabs/amazon-sagemaker-examples/issues/341 where the model_fn was called on each invocation. But in this case, there is no /invocations, the model_fn, input_fn, predict_fn, and output_fn were called multiple time. In the end, the container crashed with Internal Server Error.

Mar 04 '20 00:03 aunz

How did you resolve this please, as I am getting the same issue.

Apr 14 '20 23:04 ikennanwosu

Same issue here =/

Nov 09 '20 11:11 godardt

Same issue here. If model_fu provides functionality of loading model, do me need to load it for every batch?

Dec 14 '20 17:12 raydazn

Same issue here ..!!! Anyone found a solution to this.?

Mar 10 '23 01:03 uday1212

How is this issue solved. same issue here too..

Jun 30 '23 11:06 naresh129

Has anyone found a solution? I'm facing the same issue, the function runs 4 times, it seems like 1 time per GPU available.

Oct 11 '24 02:10 llealgt

Can you show your code? I would like to reproduce it

Oct 13 '24 09:10 HubGab-Git

Is there any update on this? It seems there's a problem with sagemaker-inference-toolkit. huggingface-sagemaker-inference-toolkit has the same issue: https://github.com/aws/sagemaker-huggingface-inference-toolkit/issues/133

Dec 07 '24 00:12 kurtgdl

In general as far as I'm aware, it's expected that the model_fn will be called multiple times because the default behaviour is for the server to load multiple copies of your model and use those to serve concurrent requests on multiple worker threads.

I've worked pretty closely with SageMaker but am not part of their core inference engineering team, so the following is based on an imperfect (and potentially outdated) understanding:

I believe both the sagemaker-scikit-learn-container and sagemaker-huggingface-inference-toolkit (for Hugging Face DLCs) use AWSLabs multi-model-server for their base inference server. The core sagemaker-inference-toolkit depends on it too as mentioned in the readme, but I know other DLCs like PyTorch and TensorFlow have been using their own ecosystems' serving stacks TorchServe and TFX.

It does make sense for the stack to support multiple worker threads so you can effectively utilize resources like instances with multiple GPUs, or a large number of CPU cores - and in general the stack should be configurable, but (IMO) it's a bit difficult to navigate with the serving stacks for these containers being split across so many different layers of code repository...

To explicitly control/limit the number of worker threads created to best utilize the hardware, I'd suggest trying environment variables:

SAGEMAKER_MODEL_SERVER_WORKERS (as per SM Inference Toolkit parameters.py)
MMS_DEFAULT_WORKERS_PER_MODEL, MMS_NETTY_CLIENT_THREADS, and possibly also MMS_NUMBER_OF_NETTY_THREADS (as per MMS configuration doc and underlying ConfigManager)

input_fn being called multiple times for a single request is more concerning as that seems like a retry. You may have to set MMS-specific timeout & payload size configurations if the SAGEMAKER_ one isn't getting picked up. For example, in the past for large-payload/long-time inference on Hugging Face v4.28 container, I used MMS_DEFAULT_RESPONSE_TIMEOUT, MMS_MAX_REQUEST_SIZE, and MMS_MAX_RESPONSE_SIZE.

Hope this helps, but it'd be great to hear from anybody who manages to clarify exactly which env vars are sufficient to control the number of model workers spawned on these containers.

Dec 16 '24 06:12 athewsey

Thanks a lot @athewsey. Your points are very useful.

Dec 17 '24 14:12 kurtgdl

model_fn and input_fn called multiple times