yocto-gl [FR] Support SageMaker multi-model endpoints

Describe the proposal

MLFlow currently seems to deploy each model to a new SageMaker instance. Since Novermber 2019, SageMaker has come up with something called multi-model endpoint, which allows users to deploy multiple models to the same instance.

One can read about multi-model endpoint here : https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/

Motivation

Multi Model Endpoint can decrease cost tremendously for organizations, by deploying multiple models to one single instance. SageMaker handles the rest, of swapping inactive models with new models whose traffic is incoming.

Proposed Changes

Probably in the mlflow.sagemaker API?

Apr 20 '20 16:04 praateekmahajan

Hi!

Totally supporting this enhancement!

Apr 23 '20 15:04 edgBR

Could this be achieved by setting the --mode flag when calling mlflow sagemaker deploy to add? Their docs says it will add a new model to an existing endpoint rather than create.

https://www.mlflow.org/docs/latest/python_api/mlflow.sagemaker.html#mlflow.sagemaker.deploy

Apr 23 '20 22:04 tallen94

Hi @praateekmahajan this sounds like a nice feature to support in the sagemaker deployment API and we'd be more than happy to review a proposal for implementing it.

@tallen94 The current --mode add option unfortunately does not support multi-model endpoints on the same instance: each new model that is added to an endpoint is currently delegated a new instance

Apr 27 '20 19:04 dbczumar

This is a feature that I'm very interested in having and willing to work on. We don't use Sagemaker for deployment so will take some time for me to research how that works, but I have been playing around with my fork trying to get it working running a the container locally.

@dbczumar let me know what you think and if you want a more formal proposal doc or if I should just push some code once it's cleaned up

To avoid changing the request data format or request routing the model you want to predict for would be placed in a request header such as ModelURI: s3::/path/to/model
The flask app can use a @app.before_request handler to fetch the model and set in the context flask.g, would want to add caching as well for future requests
Then the /invocations handler function will access the model stored in the request context flask.g["model"]
At this point you can fetch additional metadata such as model.metadata.get_input_schema().
Run the prediction
Profit! (through massive cost savings 😃 )

Right now I'm using mlflow models build-docker -m ./local/path to build the docker image so there's one MLmodel baked in there that the image is built from and specifies the environment. Since that model can already be used for serving in the docker container I was adding an env variable MULTI_MODEL_SERVER that would determine if the request context would set the model to the model baked in to the image or check the request header for the model.

Q: Does pyfunc.load_pyfunc provide sufficient checks for if the conda env matches? Or what else would need to be done here? Are there existing utils for this to compare the incoming models environment spec to the one stored in the image?

sample (incomplete untested) code:

Environment variable dictating which run mode to use and setting the model

def init(model: PyFuncModel):

    """
    Initialize the server. Loads pyfunc model from the path.
    """

    app = flask.Flask(__name__)
    MULTI_MODEL_SERVER = os.environ.get("MULTI_MODEL_SERVER", False)

    @app.before_request
    def before_request():
        if MULTI_MODEL_SERVER:
            model_uri = flask.headers.get("ModelURI", type=str)
            flask.g["model"] = pyfunc.load_pyfunc(model_uri)            
        else:
            flask.g["model"] = model

performing predictions

    @app.route("/invocations", methods=["POST"])
    @catch_mlflow_exception
    def transformation():  # pylint: disable=unused-variable
        """
        Do an inference on a single batch of data. In this sample server,
        we take data as CSV or json, convert it to a Pandas DataFrame or Numpy,
        generate predictions and convert them back to json.
        """
        current_model = flask.g["model"]
        input_schema = current_model.metadata.get_input_schema()
        ....
        try:
            raw_predictions = current_model.predict(data)

Feb 19 '21 20:02 MarkAWard

Using before_request seems unnecessary and just complicates the logic requiring a check on what route was being requested. Should be fine to simplify do something like

    @app.route("/invocations", methods=["POST"])
    @catch_mlflow_exception
    def transformation():  # pylint: disable=unused-variable
        """
        Do an inference on a single batch of data. In this sample server,
        we take data as CSV or json, convert it to a Pandas DataFrame or Numpy,
        generate predictions and convert them back to json.
        """
        if MULTI_MODEL_SERVER:
            model_uri = flask.request.headers.get("ModelURI", type=str)
            try:
                current_model = load_model(model_uri)
            except Exception as e:
                json_err = {
                    "error_code": "RESOURCE_DOES_NOT_EXIST",
                    "message": str(e)
                }
                raise RestException(json_err) from e
        else:
            current_model = model
        input_schema = current_model.metadata.get_input_schema()

I could use guidance on how we want to do error handling/raising and any other model checks that should be performed (or place trust in the user?)

Feb 22 '21 05:02 MarkAWard

This is a feature that I'm very interested in having and willing to work on. We don't use Sagemaker for deployment so will take some time for me to research how that works, but I have been playing around with my fork trying to get it working running a the container locally.

@dbczumar let me know what you think and if you want a more formal proposal doc or if I should just push some code once it's cleaned up

To avoid changing the request data format or request routing the model you want to predict for would be placed in a request header such as ModelURI: s3::/path/to/model

The flask app can use a @app.before_request handler to fetch the model and set in the context flask.g, would want to add caching as well for future requests

Then the /invocations handler function will access the model stored in the request context flask.g["model"]

At this point you can fetch additional metadata such as model.metadata.get_input_schema().

Run the prediction

Profit! (through massive cost savings 😃 )

Right now I'm using mlflow models build-docker -m ./local/path to build the docker image so there's one MLmodel baked in there that the image is built from and specifies the environment. Since that model can already be used for serving in the docker container I was adding an env variable MULTI_MODEL_SERVER that would determine if the request context would set the model to the model baked in to the image or check the request header for the model.

Q: Does pyfunc.load_pyfunc provide sufficient checks for if the conda env matches? Or what else would need to be done here? Are there existing utils for this to compare the incoming models environment spec to the one stored in the image?

sample (incomplete untested) code:

Environment variable dictating which run mode to use and setting the model
def init(model: PyFuncModel):

    """
    Initialize the server. Loads pyfunc model from the path.
    """

    app = flask.Flask(__name__)
    MULTI_MODEL_SERVER = os.environ.get("MULTI_MODEL_SERVER", False)

    @app.before_request
    def before_request():
        if MULTI_MODEL_SERVER:
            model_uri = flask.headers.get("ModelURI", type=str)
            flask.g["model"] = pyfunc.load_pyfunc(model_uri)            
        else:
            flask.g["model"] = model
performing predictions
    @app.route("/invocations", methods=["POST"])
    @catch_mlflow_exception
    def transformation():  # pylint: disable=unused-variable
        """
        Do an inference on a single batch of data. In this sample server,
        we take data as CSV or json, convert it to a Pandas DataFrame or Numpy,
        generate predictions and convert them back to json.
        """
        current_model = flask.g["model"]
        input_schema = current_model.metadata.get_input_schema()
        ....
        try:
            raw_predictions = current_model.predict(data)

This might still be useful, maybe not in the before request but in the outer init function, to preload models into the Multi model container, maybe using an envar with comma seaprated URIs

Feb 22 '21 15:02 AndersonReyes

Looking out for multi-model endpoints feature in Mlflow-Sagemaker. Any update on the enhancement?

Jun 04 '21 10:06 litty-tt

I would be also interested in this. I am wondering if we can do a workaround with regular deploy function in mlflow.sagemaker ? There is a mode=mlflow.sagemaker.DEPLOY_ADD or something like that ,not sure if that will be "shadow" or multi-model though

Jan 23 '23 14:01 adamwrobel-ext-gd

👍 I am also interested in this

Jun 20 '23 22:06 rgangopadhya

Any updates on this?

Dec 01 '23 12:12 Leothi

Any updates on this?

Seems like not. It would be interesting to start developing this feature knowing that multi model endpoints are getting used more often.

Dec 23 '23 00:12 rafzenx

This is really a blocker for us to continue adopting ml_flow for deployment. We have a lot of use cases with models that are seldomly used. If we have to pay one endpoint per model it would be way too costly and we want to mutualize them through multi-model endpoints. Not being able to do that is a blocker for adoption.

@dbczumar it seems you were happy to review a proposal about that in 2020 but I don't see any feedback to @MarkAWard proposal in 2021. Is there still a strong interest from your side? Is that worth investing time to look into this?

Jan 30 '24 10:01 cjolif

@cjolif I eventually managed to do a workaround. For each of model included in multimodel endpoint, you need to download from mlflow registry the model binary, but only those files (nothing else can be there), zip it again into model.tar.gz , then you can use directly Airflow SageMakerEndpointOperator . Just provide the config for the operator:

model config - give it a name and the ensemble zip file, you can assign model tags (mention which mlflow models) , specify also "Mode": "MultiModel",
endpoint config name
endpoint name

Then you need to apply changes to ML flow tags for all the involved models -> to point to this sagemaker Model and Endpoint.

So instead of just using mlflow.sagemaker functions you need to perform mlflow retrieval, assemble a zip file yourself, use sagemaker SDK (or Airflow operators) to create entities in sagemaker, do deployment, and manage metadata like tags yourself as well. but that is really it, works

Jan 30 '24 14:01 adamwrobel-ext-gd

yocto-gl yocto-gl copied to clipboard

[FR] Support SageMaker multi-model endpoints

Describe the proposal

Motivation

Proposed Changes

yocto-gl
yocto-gl copied to clipboard