yocto-gl
yocto-gl copied to clipboard
[BUG] Cannot deploy mlflow pyfunc model to sagemaker endpoint
Issues Policy acknowledgement
- [X] I have read and agree to submit bug reports in accordance with the issues policy
Where did you encounter this bug?
Other
Willingness to contribute
No. I cannot contribute a bug fix at this time.
MLflow version
- Tracking server: 2.9.2
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- Python version:
- yarn version, if running the dev UI:
Describe the problem
When trying to deploy a mlflow pyfunc model as a sagemaker endpoint, the endpoint never seem to be able to start when the model used has a code_path directory attached to itself, when trying to deploy the same model but without the code_path directory than the endpoint successfully starts. Any idea why the extra code directory would cause an issue ?
Tracking information
REPLACE_ME
Code to reproduce issue
# Try to update deployment
sagemaker_client.update_deployment(
name="herobanner-contextualbandit",
model_uri="models:/hero-banner/latest",
config={
"execution_role_arn": "arn:aws:iam::283774148357:role/ml-cave-sagemaker-role-prod",
"image_url": "283774148357.dkr.ecr.eu-west-1.amazonaws.com/mlflow-pyfunc:2.9.2",
"instance_type": "ml.m5.xlarge",
#"env":'{"DISABLE_NGINX": "true", "GUNICORN_CMD_ARGS": "--timeout 120 -w 4 -k gevent"}'
},
)
Stack trace
REPLACE_ME
Other info / logs
What component(s) does this bug affect?
- [ ]
area/artifacts
: Artifact stores and artifact logging - [ ]
area/build
: Build and test infrastructure for MLflow - [X]
area/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrations - [ ]
area/docs
: MLflow documentation pages - [ ]
area/examples
: Example code - [ ]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ]
area/models
: MLmodel format, model serialization/deserialization, flavors - [ ]
area/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templates - [ ]
area/projects
: MLproject format, project running backends - [ ]
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - [ ]
area/server-infra
: MLflow Tracking server backend - [ ]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- [ ]
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [ ]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - [ ]
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - [ ]
area/windows
: Windows support
What language(s) does this bug affect?
- [ ]
language/r
: R APIs and clients - [ ]
language/java
: Java APIs and clients - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/azure
: Azure and Azure ML integrations - [X]
integrations/sagemaker
: SageMaker integrations - [ ]
integrations/databricks
: Databricks integrations
@alex2308 Could you provide stacktrace?
A stacktrace of what? Sadly the only logs that I have access to are the one I sent above which are generated by the sagemaker endpoint, or is there a logging setting I can enable on the mlflow-pyfunc docker image ?
Is there any logs about why the worker timeout? BTW what's the code to repro? Details about logging the model with code_paths should be helpful
Sadly no logs I have been trying to get more detailed logs from AWS side. Maybe some settings in gunicorn that I can use to enable more logging.
Sure this is the logging call to register the model.
mlflow.pyfunc.log_model( "models", python_model=model, registered_model_name="hero-banner", pip_requirements=pip_requirements, code_path=["hero_bandit"], )
The hero bandit directory is a python package where the model is trained and deployed. I have attached a snapshot of the folder structure.
The custom PythonModel class to define the pyfunc model is the following.
Could you make sure the same folder has been copied over to the code
directory under models
folder? Another thing worth trying is to load the model locally and see if it works? mlflow.pyfunc.load_model(...)
This is the file structure of the model in mlflow, we can see that all the files have been copied correctly.
Yes I have tried to run it locally with the load_model function and that works fine, I have even use the code to run the pyfunc docker locally on my machine and to serve the endpoint locally and that also works fine, here is the code for loading the model locally and for deploying the docker locally to serve as an endpoint:
# If you want the latest version of the model
def read_model_from_mlflow(run_id: str):
if run_id is None:
model_uri_latest = "models:/hero-banner/latest"
else:
model_uri_latest = f"runs:/{run_id}/models"
print(mlflow.pyfunc.get_model_dependencies(model_uri_latest))
loaded_model = mlflow.pyfunc.load_model(model_uri_latest)
mypayload = np.array([0.6046511627906976, True, False, False, False, True, False, False,
True, False, False, False])
return loaded_model.predict(mypayload)
print(read_model_from_mlflow(None))
build_docker(name="mlflow-pyfunc")
#client = get_deploy_client("sagemaker")
mlflow.sagemaker.run_local(
name="my-local-deployment",
model_uri="models:/hero-banner/latest",
flavor="python_function",
config={
"port": 8080,
"image": "mlflow-pyfunc",
},
)
Could it be that because the directory where files are copied is called "code/", it has an impact on the way sagemaker expects the model and breaks the server somehow ?
@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.
This is some extra logging coming from AWS support, not sure if it can help @serena-ruan ?
I think in such case it's sagemaker's problem, from you stacktrace it's like out of memory
, could you open a ticket to AWS support instead?
I have already opened a case on their side and as soon as they see mlflow in the process they throw the " We apologize but we do not support third party tool, please open a ticket on mlflow side", hence why I am here ...
I think you should ask them what might be wrong from the sagemaker logs, or ask them for the full stacktrace. BTW you've validated mlflow works fine in all other environment, so they shouldn't just ignore your ticket :(