yocto-gl
yocto-gl copied to clipboard
[FR] Support SageMaker multi-model endpoints
Describe the proposal
MLFlow currently seems to deploy each model to a new SageMaker instance. Since Novermber 2019, SageMaker has come up with something called multi-model endpoint, which allows users to deploy multiple models to the same instance.
One can read about multi-model endpoint here : https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/
Motivation
Multi Model Endpoint can decrease cost tremendously for organizations, by deploying multiple models to one single instance. SageMaker handles the rest, of swapping inactive models with new models whose traffic is incoming.
Proposed Changes
Probably in the mlflow.sagemaker
API?
Hi!
Totally supporting this enhancement!
Could this be achieved by setting the --mode
flag when calling mlflow sagemaker deploy
to add
? Their docs says it will add a new model to an existing endpoint rather than create
.
https://www.mlflow.org/docs/latest/python_api/mlflow.sagemaker.html#mlflow.sagemaker.deploy
Hi @praateekmahajan this sounds like a nice feature to support in the sagemaker deployment API and we'd be more than happy to review a proposal for implementing it.
@tallen94 The current --mode add
option unfortunately does not support multi-model endpoints on the same instance: each new model that is added to an endpoint is currently delegated a new instance
This is a feature that I'm very interested in having and willing to work on. We don't use Sagemaker for deployment so will take some time for me to research how that works, but I have been playing around with my fork trying to get it working running a the container locally.
@dbczumar let me know what you think and if you want a more formal proposal doc or if I should just push some code once it's cleaned up
- To avoid changing the request data format or request routing the model you want to predict for would be placed in a request header such as
ModelURI: s3::/path/to/model
- The flask app can use a
@app.before_request
handler to fetch the model and set in the contextflask.g
, would want to add caching as well for future requests - Then the
/invocations
handler function will access the model stored in the request contextflask.g["model"]
- At this point you can fetch additional metadata such as
model.metadata.get_input_schema()
. - Run the prediction
- Profit! (through massive cost savings 😃 )
Right now I'm using mlflow models build-docker -m ./local/path
to build the docker image so there's one MLmodel baked in there that the image is built from and specifies the environment. Since that model can already be used for serving in the docker container I was adding an env variable MULTI_MODEL_SERVER
that would determine if the request context would set the model to the model baked in to the image or check the request header for the model.
Q: Does pyfunc.load_pyfunc
provide sufficient checks for if the conda env matches? Or what else would need to be done here? Are there existing utils for this to compare the incoming models environment spec to the one stored in the image?
sample (incomplete untested) code:
Environment variable dictating which run mode to use and setting the model
def init(model: PyFuncModel):
"""
Initialize the server. Loads pyfunc model from the path.
"""
app = flask.Flask(__name__)
MULTI_MODEL_SERVER = os.environ.get("MULTI_MODEL_SERVER", False)
@app.before_request
def before_request():
if MULTI_MODEL_SERVER:
model_uri = flask.headers.get("ModelURI", type=str)
flask.g["model"] = pyfunc.load_pyfunc(model_uri)
else:
flask.g["model"] = model
performing predictions
@app.route("/invocations", methods=["POST"])
@catch_mlflow_exception
def transformation(): # pylint: disable=unused-variable
"""
Do an inference on a single batch of data. In this sample server,
we take data as CSV or json, convert it to a Pandas DataFrame or Numpy,
generate predictions and convert them back to json.
"""
current_model = flask.g["model"]
input_schema = current_model.metadata.get_input_schema()
....
try:
raw_predictions = current_model.predict(data)
Using before_request
seems unnecessary and just complicates the logic requiring a check on what route was being requested. Should be fine to simplify do something like
@app.route("/invocations", methods=["POST"])
@catch_mlflow_exception
def transformation(): # pylint: disable=unused-variable
"""
Do an inference on a single batch of data. In this sample server,
we take data as CSV or json, convert it to a Pandas DataFrame or Numpy,
generate predictions and convert them back to json.
"""
if MULTI_MODEL_SERVER:
model_uri = flask.request.headers.get("ModelURI", type=str)
try:
current_model = load_model(model_uri)
except Exception as e:
json_err = {
"error_code": "RESOURCE_DOES_NOT_EXIST",
"message": str(e)
}
raise RestException(json_err) from e
else:
current_model = model
input_schema = current_model.metadata.get_input_schema()
I could use guidance on how we want to do error handling/raising and any other model checks that should be performed (or place trust in the user?)
This is a feature that I'm very interested in having and willing to work on. We don't use Sagemaker for deployment so will take some time for me to research how that works, but I have been playing around with my fork trying to get it working running a the container locally.
@dbczumar let me know what you think and if you want a more formal proposal doc or if I should just push some code once it's cleaned up
- To avoid changing the request data format or request routing the model you want to predict for would be placed in a request header such as
ModelURI: s3::/path/to/model
- The flask app can use a
@app.before_request
handler to fetch the model and set in the contextflask.g
, would want to add caching as well for future requests- Then the
/invocations
handler function will access the model stored in the request contextflask.g["model"]
- At this point you can fetch additional metadata such as
model.metadata.get_input_schema()
.- Run the prediction
- Profit! (through massive cost savings 😃 )
Right now I'm using
mlflow models build-docker -m ./local/path
to build the docker image so there's one MLmodel baked in there that the image is built from and specifies the environment. Since that model can already be used for serving in the docker container I was adding an env variableMULTI_MODEL_SERVER
that would determine if the request context would set the model to the model baked in to the image or check the request header for the model.Q: Does
pyfunc.load_pyfunc
provide sufficient checks for if the conda env matches? Or what else would need to be done here? Are there existing utils for this to compare the incoming models environment spec to the one stored in the image?sample (incomplete untested) code:
Environment variable dictating which run mode to use and setting the model
def init(model: PyFuncModel): """ Initialize the server. Loads pyfunc model from the path. """ app = flask.Flask(__name__) MULTI_MODEL_SERVER = os.environ.get("MULTI_MODEL_SERVER", False) @app.before_request def before_request(): if MULTI_MODEL_SERVER: model_uri = flask.headers.get("ModelURI", type=str) flask.g["model"] = pyfunc.load_pyfunc(model_uri) else: flask.g["model"] = model
performing predictions
@app.route("/invocations", methods=["POST"]) @catch_mlflow_exception def transformation(): # pylint: disable=unused-variable """ Do an inference on a single batch of data. In this sample server, we take data as CSV or json, convert it to a Pandas DataFrame or Numpy, generate predictions and convert them back to json. """ current_model = flask.g["model"] input_schema = current_model.metadata.get_input_schema() .... try: raw_predictions = current_model.predict(data)
This might still be useful, maybe not in the before request but in the outer init function, to preload models into the Multi model container, maybe using an envar with comma seaprated URIs
Looking out for multi-model endpoints feature in Mlflow-Sagemaker. Any update on the enhancement?
I would be also interested in this. I am wondering if we can do a workaround with regular deploy
function in mlflow.sagemaker
? There is a mode=mlflow.sagemaker.DEPLOY_ADD
or something like that ,not sure if that will be "shadow" or multi-model though
👍 I am also interested in this
Any updates on this?
Any updates on this?
Seems like not. It would be interesting to start developing this feature knowing that multi model endpoints are getting used more often.
This is really a blocker for us to continue adopting ml_flow for deployment. We have a lot of use cases with models that are seldomly used. If we have to pay one endpoint per model it would be way too costly and we want to mutualize them through multi-model endpoints. Not being able to do that is a blocker for adoption.
@dbczumar it seems you were happy to review a proposal about that in 2020 but I don't see any feedback to @MarkAWard proposal in 2021. Is there still a strong interest from your side? Is that worth investing time to look into this?
@cjolif I eventually managed to do a workaround.
For each of model included in multimodel endpoint, you need to download from mlflow
registry the model binary, but only those files (nothing else can be there), zip it again into model.tar.gz
, then you can use directly Airflow SageMakerEndpointOperator
. Just provide the config for the operator:
- model config - give it a name and the ensemble zip file, you can assign model tags (mention which mlflow models) , specify also
"Mode": "MultiModel",
- endpoint config name
- endpoint name
Then you need to apply changes to ML flow tags for all the involved models -> to point to this sagemaker Model and Endpoint.
So instead of just using mlflow.sagemaker
functions you need to perform mlflow
retrieval, assemble a zip file yourself, use sagemaker SDK (or Airflow operators) to create entities in sagemaker, do deployment, and manage metadata like tags yourself as well.
but that is really it, works