yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[BUG] Cannot deploy mlflow pyfunc model to sagemaker endpoint

Open alex2308 opened this issue 1 year ago • 12 comments

Issues Policy acknowledgement

  • [X] I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Other

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

  • Tracking server: 2.9.2

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Python version:
  • yarn version, if running the dev UI:

Describe the problem

When trying to deploy a mlflow pyfunc model as a sagemaker endpoint, the endpoint never seem to be able to start when the model used has a code_path directory attached to itself, when trying to deploy the same model but without the code_path directory than the endpoint successfully starts. Any idea why the extra code directory would cause an issue ?

Tracking information

REPLACE_ME

Code to reproduce issue

# Try to update deployment
sagemaker_client.update_deployment(
    name="herobanner-contextualbandit",
    model_uri="models:/hero-banner/latest",
    config={
        "execution_role_arn": "arn:aws:iam::283774148357:role/ml-cave-sagemaker-role-prod",
        "image_url": "283774148357.dkr.ecr.eu-west-1.amazonaws.com/mlflow-pyfunc:2.9.2",
        "instance_type": "ml.m5.xlarge",
        #"env":'{"DISABLE_NGINX": "true", "GUNICORN_CMD_ARGS": "--timeout 120 -w 4 -k gevent"}'
    },
)

Screenshot from 2024-01-26 16-10-02 bad_structure

Stack trace

REPLACE_ME

Other info / logs

Screenshot from 2024-01-26 16-13-34

What component(s) does this bug affect?

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [X] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • [ ] area/docs: MLflow documentation pages
  • [ ] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • [ ] area/projects: MLproject format, project running backends
  • [ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [ ] area/server-infra: MLflow Tracking server backend
  • [ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • [ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

What language(s) does this bug affect?

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [X] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

alex2308 avatar Jan 26 '24 15:01 alex2308

@alex2308 Could you provide stacktrace?

serena-ruan avatar Jan 29 '24 03:01 serena-ruan

A stacktrace of what? Sadly the only logs that I have access to are the one I sent above which are generated by the sagemaker endpoint, or is there a logging setting I can enable on the mlflow-pyfunc docker image ?

alex2308 avatar Jan 29 '24 09:01 alex2308

Is there any logs about why the worker timeout? BTW what's the code to repro? Details about logging the model with code_paths should be helpful

serena-ruan avatar Jan 29 '24 10:01 serena-ruan

Sadly no logs I have been trying to get more detailed logs from AWS side. Maybe some settings in gunicorn that I can use to enable more logging. Sure this is the logging call to register the model. mlflow.pyfunc.log_model( "models", python_model=model, registered_model_name="hero-banner", pip_requirements=pip_requirements, code_path=["hero_bandit"], ) The hero bandit directory is a python package where the model is trained and deployed. I have attached a snapshot of the folder structure. Screenshot from 2024-01-29 12-02-10

The custom PythonModel class to define the pyfunc model is the following. Screenshot from 2024-01-29 12-03-30

alex2308 avatar Jan 29 '24 11:01 alex2308

Could you make sure the same folder has been copied over to the code directory under models folder? Another thing worth trying is to load the model locally and see if it works? mlflow.pyfunc.load_model(...)

serena-ruan avatar Jan 30 '24 03:01 serena-ruan

This is the file structure of the model in mlflow, we can see that all the files have been copied correctly. Screenshot from 2024-01-30 11-18-18 Yes I have tried to run it locally with the load_model function and that works fine, I have even use the code to run the pyfunc docker locally on my machine and to serve the endpoint locally and that also works fine, here is the code for loading the model locally and for deploying the docker locally to serve as an endpoint:

# If you want the latest version of the model
def read_model_from_mlflow(run_id: str):
    if run_id is None:
        model_uri_latest = "models:/hero-banner/latest"
    else:
        model_uri_latest = f"runs:/{run_id}/models"
    print(mlflow.pyfunc.get_model_dependencies(model_uri_latest))
    loaded_model = mlflow.pyfunc.load_model(model_uri_latest)
    mypayload = np.array([0.6046511627906976, True, False, False, False, True, False, False,
        True, False, False, False])
    return loaded_model.predict(mypayload)

print(read_model_from_mlflow(None))
build_docker(name="mlflow-pyfunc")

#client = get_deploy_client("sagemaker")
mlflow.sagemaker.run_local(
    name="my-local-deployment",
    model_uri="models:/hero-banner/latest",
    flavor="python_function",
    config={
        "port": 8080,
        "image": "mlflow-pyfunc",
    },
)

alex2308 avatar Jan 30 '24 10:01 alex2308

Could it be that because the directory where files are copied is called "code/", it has an impact on the way sagemaker expects the model and breaks the server somehow ?

alex2308 avatar Feb 01 '24 09:02 alex2308

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

github-actions[bot] avatar Feb 03 '24 00:02 github-actions[bot]

This is some extra logging coming from AWS support, not sure if it can help @serena-ruan ? Screenshot from 2024-02-05 11-23-29

alex2308 avatar Feb 05 '24 10:02 alex2308

I think in such case it's sagemaker's problem, from you stacktrace it's like out of memory, could you open a ticket to AWS support instead?

serena-ruan avatar Feb 06 '24 06:02 serena-ruan

I have already opened a case on their side and as soon as they see mlflow in the process they throw the " We apologize but we do not support third party tool, please open a ticket on mlflow side", hence why I am here ...

alex2308 avatar Feb 06 '24 08:02 alex2308

I think you should ask them what might be wrong from the sagemaker logs, or ask them for the full stacktrace. BTW you've validated mlflow works fine in all other environment, so they shouldn't just ignore your ticket :(

serena-ruan avatar Feb 08 '24 01:02 serena-ruan