zenml icon indicating copy to clipboard operation
zenml copied to clipboard

[BUG]: Model not deploying on `mlflow`

Open Adiii1436 opened this issue 1 year ago • 13 comments

Contact Details [Optional]

No response

System Information

ZENML_LOCAL_VERSION: 0.53.1
ZENML_SERVER_VERSION: 0.53.1
ZENML_SERVER_DATABASE: sqlite
ZENML_SERVER_DEPLOYMENT_TYPE: other
ZENML_CONFIG_DIR: 
/home/aditya-anand/.config/zenml
ZENML_LOCAL_STORE_DIR: 
/home/aditya-anand/.config/zenml/local_sto
res
ZENML_SERVER_URL: 
sqlite:////home/aditya-anand/.config/zenml
/local_stores/default_zen_store/zenml.db
ZENML_ACTIVE_REPOSITORY_ROOT: 
/home/aditya-anand/Adaboost/CNN
PYTHON_VERSION: 3.11.6
ENVIRONMENT: native
SYSTEM_INFO: {'os': 'linux', 
'linux_distro': 'ubuntu', 
'linux_distro_like': 'debian', 
'linux_distro_version': '23.10'}
ACTIVE_WORKSPACE: default
ACTIVE_STACK: custom_stack
ACTIVE_USER: default
TELEMETRY_STATUS: enabled
ANALYTICS_CLIENT_ID: 
ec2a04fd-c7ba-42a9-9292-213b25df48fd
ANALYTICS_USER_ID: 
19b62867-c0c4-4a27-8e1d-a6c2afeeee8b
ANALYTICS_SERVER_ID: 
ec2a04fd-c7ba-42a9-9292-213b25df48fd
INTEGRATIONS: ['kaniko', 'mlflow', 
'pillow', 'pytorch', 'scipy', 'sklearn']
PACKAGES: {'brotli': '1.1.0', 'gitpython':
'3.1.40', 'jinja2': '3.1.2', 'mako': 
'1.3.0', 'markdown': '3.5.1', 
'markupsafe': '2.1.3', 'pillow': '10.1.0',
'pyjwt': '2.7.0', 'pymysql': '1.0.3', 
'pyyaml': '6.0.1', 'sqlalchemy': '1.4.41',
'sqlalchemy-utils': '0.38.3', 'aiofiles': 
'23.2.1', 'aiohttp': '3.9.1', 'aiokafka': 
'0.10.0', 'aiosignal': '1.3.1', 'alembic':
'1.8.1', 'anyio': '4.2.0', 'asttokens': 
'2.4.1', 'async-timeout': '4.0.3', 
'attrs': '23.1.0', 'azure-common': 
'1.1.28', 'azure-core': '1.29.6', 
'azure-mgmt-core': '1.4.0', 
'azure-mgmt-resource': '23.0.1', 'bcrypt':
'4.0.1', 'blinker': '1.7.0', 'cachetools':
'5.3.2', 'certifi': '2023.11.17', 'cffi': 
'1.16.0', 'charset-normalizer': '3.3.2', 
'click': '8.1.3', 'click-params': '0.3.0',
'cloudpickle': '2.2.1', 'comm': '0.2.0', 
'contourpy': '1.2.0', 'cryptography': 
'41.0.7', 'cycler': '0.12.1', 
'databricks-cli': '0.18.0', 'decorator': 
'5.1.1', 'distro': '1.9.0', 'docker': 
'6.1.3', 'entrypoints': '0.4', 
'executing': '2.0.1', 'fastapi': '0.89.1',
'fastapi-utils': '0.2.1', 'filelock': 
'3.13.1', 'flask': '3.0.0', 'fonttools': 
'4.47.0', 'frozenlist': '1.4.1', 'fsspec':
'2023.12.2', 'gevent': '23.9.1', 
'geventhttpclient': '2.0.2', 'gitdb': 
'4.0.11', 'greenlet': '3.0.3', 'grpcio': 
'1.60.0', 'gunicorn': '21.2.0', 'h11': 
'0.14.0', 'httplib2': '0.19.1', 
'httptools': '0.6.1', 'idna': '3.6', 
'importlib-metadata': '7.0.1', 
'importlib-resources': '6.1.1', 'ipinfo': 
'5.0.0', 'ipython': '8.19.0', 
'ipywidgets': '8.1.1', 'isodate': '0.6.1',
'itsdangerous': '2.1.2', 'jedi': '0.19.1',
'joblib': '1.3.2', 'jupyterlab-widgets': 
'3.0.9', 'kiwisolver': '1.4.5', 
'markdown-it-py': '3.0.0', 'matplotlib': 
'3.8.2', 'matplotlib-inline': '0.1.6', 
'mdurl': '0.1.2', 'mlflow': '2.9.2', 
'mlserver': '1.3.5', 'mlserver-mlflow': 
'1.3.5', 'mpmath': '1.3.0', 'multidict': 
'6.0.4', 'networkx': '3.2.1', 'numpy': 
'1.26.2', 'nvidia-cublas-cu12': 
'12.1.3.1', 'nvidia-cuda-cupti-cu12': 
'12.1.105', 'nvidia-cuda-nvrtc-cu12': 
'12.1.105', 'nvidia-cuda-runtime-cu12': 
'12.1.105', 'nvidia-cudnn-cu12': 
'8.9.2.26', 'nvidia-cufft-cu12': 
'11.0.2.54', 'nvidia-curand-cu12': 
'10.3.2.106', 'nvidia-cusolver-cu12': 
'11.4.5.107', 'nvidia-cusparse-cu12': 
'12.1.0.106', 'nvidia-nccl-cu12': 
'2.18.1', 'nvidia-nvjitlink-cu12': 
'12.3.101', 'nvidia-nvtx-cu12': 
'12.1.105', 'oauthlib': '3.2.2', 'orjson':
'3.8.14', 'packaging': '23.2', 'pandas': 
'2.1.4', 'parso': '0.8.3', 'passlib': 
'1.7.4', 'pexpect': '4.9.0', 'pip': 
'23.2', 'prometheus-client': '0.19.0', 
'prompt-toolkit': '3.0.43', 'protobuf': 
'4.25.1', 'psutil': '5.9.7', 'ptyprocess':
'0.7.0', 'pure-eval': '0.2.2', 
'py-grpc-prometheus': '0.7.0', 'pyarrow': 
'14.0.2', 'pycparser': '2.21', 'pydantic':
'1.10.13', 'pygments': '2.17.2', 
'pyparsing': '2.4.7', 'python-dateutil': 
'2.8.2', 'python-dotenv': '1.0.0', 
'python-multipart': '0.0.6', 
'python-rapidjson': '1.14', 'pytz': 
'2023.3.post1', 'querystring-parser': 
'1.2.4', 'requests': '2.31.0', 'rich': 
'13.7.0', 'scikit-learn': '1.3.2', 
'scipy': '1.11.4', 'setuptools': '68.1.2',
'six': '1.16.0', 'smmap': '5.0.1', 
'sniffio': '1.3.0', 'sqlalchemy2-stubs': 
'0.0.2a37', 'sqlmodel': '0.0.8', 
'sqlparse': '0.4.4', 'stack-data': 
'0.6.3', 'starlette': '0.22.0', 
'starlette-exporter': '0.17.1', 'sympy': 
'1.12', 'tabulate': '0.9.0', 
'threadpoolctl': '3.2.0', 'torch': 
'2.1.2', 'torchvision': '0.16.2', 
'traitlets': '5.14.0', 'triton': '2.1.0', 
'tritonclient': '2.41.0', 
'typing-extensions': '4.9.0', 'tzdata': 
'2023.3', 'urllib3': '2.1.0', 'uvicorn': 
'0.25.0', 'uvloop': '0.19.0', 
'validators': '0.18.2', 'watchfiles': 
'0.21.0', 'wcwidth': '0.2.12', 
'websocket-client': '1.7.0', 'websockets':
'12.0', 'werkzeug': '3.0.1', 
'widgetsnbextension': '4.0.9', 'yarl': 
'1.9.4', 'zenml': '0.53.1', 'zipp': 
'3.17.0', 'zope.event': '5.0', 
'zope.interface': '6.1'}

CURRENT STACK

Name: custom_stack
ID: 4f7b1885-3bd4-4b3f-b2cf-b128c6cf51aa
User: default / 
19b62867-c0c4-4a27-8e1d-a6c2afeeee8b
Workspace: default / 
664835cb-90e2-40f1-bfd7-d4207caf3613

ORCHESTRATOR: default

Name: default
ID: a89e09c1-1a3f-4a1e-8c3a-5a6f8d343616
Type: orchestrator
Flavor: local
Configuration: {}
Workspace: default / 
664835cb-90e2-40f1-bfd7-d4207caf3613

ARTIFACT_STORE: default

Name: default
ID: 1f5ccc05-be16-41db-87ab-6980006abd62
Type: artifact_store
Flavor: local
Configuration: {'path': ''}
Workspace: default / 
664835cb-90e2-40f1-bfd7-d4207caf3613

MODEL_DEPLOYER: mlflow_deployer

Name: mlflow_deployer
ID: e9842188-0ecd-44d8-8b53-9dc43d87ca49
Type: model_deployer
Flavor: mlflow
Configuration: {'service_path': ''}
User: default / 
19b62867-c0c4-4a27-8e1d-a6c2afeeee8b
Workspace: default / 
664835cb-90e2-40f1-bfd7-d4207caf3613

EXPERIMENT_TRACKER: 
mlflow_experiment_tracker

Name: mlflow_experiment_tracker
ID: 5818e715-4856-47c6-9815-3d7b308ce152
Type: experiment_tracker
Flavor: mlflow
Configuration: {'experiment_name': None, 
'nested': False, 'tags': {}, 
'tracking_uri': 
'https://dagshub.com/Adiii1436/CNN_MLFLOW.
mlflow', 'tracking_username': '********', 
'tracking_password': '********', 
'tracking_token': '********', 
'tracking_insecure_tls': False, 
'databricks_host': None}
User: default / 
19b62867-c0c4-4a27-8e1d-a6c2afeeee8b
Workspace: default / 
664835cb-90e2-40f1-bfd7-d4207caf3613

What happened?

Error:

MLflow deployment service started and reachable at:
    http://127.0.0.1:8000/invocations

Stopping existing services...
Step mlflow_model_deployer_step has finished in 13.595s.
Run continous_deployment_pipeline-2023_12_25-17_37_02_483102 has finished in 44.326s.
You can visualize your pipeline runs in the ZenML Dashboard. In order to try it locally, please run zenml up.
You can run:
     mlflow ui --backend-store-uri 
'https://dagshub.com/Adiii1436/CNN_MLFLOW.
mlflow
 ...to inspect your experiment runs within
the MLflow UI.
You can find your runs tracked within the 
`mlflow_example_pipeline` experiment. 
There you'll also be able to compare two 
or more runs.


No MLflow prediction server is currently 
running. The deployment pipeline must run 
first to train a model and deploy it. 
Execute the same command with the 
`--deploy` argument to deploy a model.

I dont know why it is stopping the existing services and preventing the model to deploy. I previously tried it in WSL but chatgpt said it might be due to less memory. I increased the memory but got the same error. Later I installed ubuntu(no dualboot) but got the same error. Please fix this.

Reproduction steps

deployment_pipeline.py

import torch
from steps.batch_data import batch_df
from steps.helper_functions import accuracy_fn
from steps.ingest_data import ingest_df
from zenml import pipeline, step
from zenml.config import DockerSettings
from zenml.integrations.mlflow.model_deployers.mlflow_model_deployer import (
    MLFlowModelDeployer,
)
from torch.utils.data import DataLoader
from zenml.constants import DEFAULT_SERVICE_START_STOP_TIMEOUT
from zenml.integrations.constants import MLFLOW
from zenml.integrations.mlflow.services import MLFlowDeploymentService
from zenml.integrations.mlflow.steps import mlflow_model_deployer_step
from zenml.steps import BaseParameters

from steps.initialize_model import initialize_model
from steps.train_test_model import train_test_model
from .utils import get_data_for_test

docker_settings = DockerSettings(required_integrations={MLFLOW})

@step(enable_cache=False)
def dynamic_importer() -> str:
    data = get_data_for_test()
    return data

class DeploymentTriggerConfig(BaseParameters):
    min_accuracy: float = 70.0

@step
def deployment_trigger(accuracy:float,config:DeploymentTriggerConfig)->bool:
    return accuracy>=config.min_accuracy    

class MLFlowDeploymentLoaderStepParameters(BaseParameters):
    pipeline_name: str
    step_name: str
    running: bool = True

@step(enable_cache=False)
def prediction_service_loader(
    pipeline_name: str,
    pipeline_step_name: str,
    running: bool = True,
    model_name: str = "model",
) -> MLFlowDeploymentService:
    model_deployer = MLFlowModelDeployer.get_active_model_deployer()

    existing_services = model_deployer.find_model_server(
        pipeline_name=pipeline_name,
        pipeline_step_name=pipeline_step_name,
        model_name=model_name,
        running=running,
    )

    if not existing_services:
        raise RuntimeError(
            f"No MLflow prediction service deployed by the "
            f"{pipeline_step_name} step in the {pipeline_name} "
            f"pipeline for the '{model_name}' model is currently "
            f"running."
        )
    print(existing_services)
    print(type(existing_services))
    return existing_services[0]


@step
def predictor(
    service: MLFlowDeploymentService,
    data: DataLoader,
) -> float:
    
    test_acc = 0
    
    with torch.inference_mode():
        for X, y in data:
            test_pred = service(X)
            test_acc +=  accuracy_fn(y, test_pred.argmax(dim=1))
        
    test_acc /= len(data)
    return test_acc
    
    
@pipeline(enable_cache=False, settings={"docker":docker_settings})
def continous_deployment_pipeline(
    min_accuracy: float = 0,
    workers: int = 3,
    timeout: int = DEFAULT_SERVICE_START_STOP_TIMEOUT,
):
    train_data, test_data, classes = ingest_df()
    train_dataloader, test_dataloader = batch_df(train_data,test_data)
    model,model_path = initialize_model(class_names=classes, hidden_units=10)

    _, train_acc, _, _ = train_test_model(
        model_path=model_path,
        model=model, 
        train_dataloader=train_dataloader, 
        test_dataloader=test_dataloader,
        hidden_units=10,
        classes=classes
    )

    deployment_decision = deployment_trigger(train_acc)

    mlflow_model_deployer_step(
        model=model_path,
        deploy_decision=deployment_decision,
        workers=workers,
        timeout=timeout
    )


@pipeline(enable_cache=False, settings={"docker": docker_settings})
def inference_pipeline(pipeline_name: str, pipeline_step_name: str):
    batch_data = dynamic_importer()
    model_deployment_service = prediction_service_loader(
        pipeline_name=pipeline_name,
        pipeline_step_name=pipeline_step_name,
        running=False,
    )
    predictor(service=model_deployment_service, data=batch_data)

run_deployment.py

from pipelines.deployment_pipeline import continous_deployment_pipeline, inference_pipeline
import click 
from rich import print 
from zenml.integrations.mlflow.mlflow_utils import get_tracking_uri
from zenml.integrations.mlflow.model_deployers.mlflow_model_deployer import (
    MLFlowModelDeployer,
)
from zenml.integrations.mlflow.services import MLFlowDeploymentService
from typing import cast

DEPLOY = "deploy"
PREDICT = "predict"
DEPLOY_AND_PREDICT = "deploy_and_predict"

@click.command()
@click.option(
    "--config",
    "-c",
    type=click.Choice([DEPLOY, PREDICT, DEPLOY_AND_PREDICT]),
    default=DEPLOY_AND_PREDICT,
    help="Optionally you can choose to only run the deployment "
    "pipeline to train and deploy a model (`deploy`), or to "
    "only run a prediction against the deployed model "
    "('predict'). By default both will be run "
    "('deploy_and_predict').",
)

@click.option(
    "--min-accuracy",
    default=70,
    help="Minimum accuracy for the model to be deployed.",
)

def run_deployment(config:str, min_accuracy:float):
    mlflow_model_deployer_component = MLFlowModelDeployer.get_active_model_deployer()

    deploy = config == DEPLOY or config == DEPLOY_AND_PREDICT
    predict = config == PREDICT or config == DEPLOY_AND_PREDICT

    if deploy:
        continous_deployment_pipeline(
            min_accuracy=min_accuracy,
            workers=1,
            timeout=6000,
        )

    if predict:
        inference_pipeline(
            pipeline_name="continuous_deployment_pipeline",
            pipeline_step_name="mlflow_model_deployer_step",
        )   

    print(
        "You can run:\n "
        f"[italic green]    mlflow ui --backend-store-uri '{get_tracking_uri()}"
        "[/italic green]\n ...to inspect your experiment runs within the MLflow"
        " UI.\nYou can find your runs tracked within the "
        "`mlflow_example_pipeline` experiment. There you'll also be able to "
        "compare two or more runs.\n\n"
    )

    # fetch existing services with same pipeline name, step name and model name
    existing_services = mlflow_model_deployer_component.find_model_server(
        pipeline_name="continuous_deployment_pipeline",
        pipeline_step_name="mlflow_model_deployer_step",
        model_name="model",
    )

    if existing_services:
        service = cast(MLFlowDeploymentService, existing_services[0])
        if service.is_running:
            print(
                f"The MLflow prediction server is running locally as a daemon "
                f"process service and accepts inference requests at:\n"
                f"    {service.prediction_url}\n"
                f"To stop the service, run "
                f"[italic green]`zenml model-deployer models delete "
                f"{str(service.uuid)}`[/italic green]."
            )
        elif service.is_failed:
            print(
                f"The MLflow prediction server is in a failed state:\n"
                f" Last state: '{service.status.state.value}'\n"
                f" Last error: '{service.status.last_error}'"
            )
    else:
        print(
            "No MLflow prediction server is currently running. The deployment "
            "pipeline must run first to train a model and deploy it. Execute "
            "the same command with the `--deploy` argument to deploy a model."
        ) 

if __name__ == "__main__":
    run_deployment()

Relevant log output

Initiating a new run for the pipeline: continous_deployment_pipeline.
The BaseParameters class to define step parameters is deprecated. Check out our docs https://docs.zenml.io/user-guide/advanced-guide/pipelining-features/configure-steps-pipelines for information on how to parameterize your steps. As a quick fix to get rid of this warning, make sure your parameter class inherits from pydantic.BaseModel instead of the BaseParameters class.
Registered new version: (version 4).
Executing a new run.
Caching is disabled by default for continous_deployment_pipeline.
Using user: default
Using stack: custom_stack
  artifact_store: default
  experiment_tracker: mlflow_experiment_tracker
  orchestrator: default
  model_deployer: mlflow_deployer
Step ingest_df has started.
Using torch version: 2.1.2+cu121
Downloading data from pytorch server
Downloaded data from pytorch server
Step ingest_df has finished in 0.666s.
Step batch_df has started.
Transformed data into dataloaders
Step batch_df has finished in 0.569s.
Step initialize_model has started.
Initializing model
Model created and saved to: saved_model\FashionMNIST_Model.pth
Step initialize_model has finished in 0.090s.
Step train_test_model has started.
Starting training and testing loop
Epoch: 0
-------
Loading model for training
Model loaded for training
Initializing loss function and optimizer
Starting training
Looked at 0/60000 samples
Looked at 12800/60000 samples
Looked at 25600/60000 samples
Looked at 38400/60000 samples
Looked at 51200/60000 samples
Finished training

Train loss: 0.5798 | Train acc: 79.0167
Loading model for testing
Model loaded for testing
Initializing loss function

Test loss: 2.3023 | Test acc: 10.0040

/home/aditya-anand/Adaboost/CNN/venv/lib/python3.11/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
Registered model 'CNN_MODEL' already exists. Creating a new version of this model...
2023/12/25 23:07:32 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: CNN_MODEL, version 10
Created version '10' of model 'CNN_MODEL'.
Finished training and testing
/home/aditya-anand/Adaboost/CNN/venv/lib/python3.11/site-packages/zenml/integrations/mlflow/experiment_trackers/mlflow_experiment_tracker.py:245: FutureWarning: ``mlflow.gluon.autolog`` is deprecated since 2.5.0. This method will be removed in a future release.
  module.autolog(disable=True)
Failed to disable MLflow autologging for the following frameworks: ['tensorflow'].
Step train_test_model has finished in 28.972s.
Step deployment_trigger has started.
Step deployment_trigger has finished in 0.051s.
Caching disabled explicitly for mlflow_model_deployer_step.
Step mlflow_model_deployer_step has started.
Updating an existing MLflow deployment service: MLFlowDeploymentService[680269df-ff43-44fd-b6d2-9643e4691755] (type: model-serving, flavor: mlflow)
MLflow deployment service started and reachable at:
    http://127.0.0.1:8000/invocations

Stopping existing services...
Step mlflow_model_deployer_step has finished in 13.595s.
Run continous_deployment_pipeline-2023_12_25-17_37_02_483102 has finished in 44.326s.
You can visualize your pipeline runs in the ZenML Dashboard. In order to try it locally, please run zenml up.
You can run:
     mlflow ui --backend-store-uri 
'https://dagshub.com/Adiii1436/CNN_MLFLOW.
mlflow
 ...to inspect your experiment runs within
the MLflow UI.
You can find your runs tracked within the 
`mlflow_example_pipeline` experiment. 
There you'll also be able to compare two 
or more runs.


No MLflow prediction server is currently 
running. The deployment pipeline must run 
first to train a model and deploy it. 
Execute the same command with the 
`--deploy` argument to deploy a model.

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

Adiii1436 avatar Dec 25 '23 17:12 Adiii1436

Maybe the issue is related to Windows (if you are using Windows). Otherwise you can try the below steps:

  • Before re-running, stop any existing MLflow deployment using:

zenml model-deployer models delete <model_uuid>

  • Configure the deployment to use a different port instead of 8000
  • Delete any existing conda environments in .zenml folder

I'd also recommend checking the logs of the model server container for errors.

Vishal-Padia avatar Dec 28 '23 15:12 Vishal-Padia

I am using ubuntu. I have tried all your steps but not working.

Adiii1436 avatar Dec 29 '23 05:12 Adiii1436

image

Adiii1436 avatar Dec 29 '23 06:12 Adiii1436

@Adiii1436 What exactly is now not working?

htahir1 avatar Jan 02 '24 14:01 htahir1

Same issue here with several versions (0.46.0, 0.50.0, 0.53.1) using the mlflow_model_deployer_step. After running the pipeline for a second time and redeploying the model, the new deployed model stops running. I generate the logs using the zenml model-deployer models logs <UUID> command and it shows that after the model starts running, it receives a signal to stop:

image

viniciusfacco avatar Jan 03 '24 17:01 viniciusfacco

@Adiii1436 Maybe some other service is running on Port 8000

Can you change the deployment to some other port?

According to your code, you have set a timeout for 6000s and are training a CNN model which can take more time, try to increase the timeout criteria and let me know what happens!

Vishal-Padia avatar Jan 04 '24 11:01 Vishal-Padia

@Vishal-Padia Can you share how to change the deployment port.

Adiii1436 avatar Jan 04 '24 15:01 Adiii1436

You can configure the port parameter in the MLFlowModelDeployer component.

For Example:

from zenml.integrations.mlflow.model_deployers import MLFlowModelDeployer

model_deployer = MLFlowModelDeployer()
model_deployer.config = {
    "port": 8501, # Use port 8501 instead of default 8000
    # Other config
}

So when mlflow_model_deployer_step is executed, it will use the custom model deployer with the configured port.

You can also set the port directly on the MLFlowDeploymentService:

deployment_service = MLFlowDeploymentService(port=8501)

Vishal-Padia avatar Jan 06 '24 13:01 Vishal-Padia

from zenml.integrations.mlflow.model_deployers import MLFlowModelDeployer

model_deployer = MLFlowModelDeployer()
model_deployer.config = {
    "port": 8501, # Use port 8501 instead of default 8000
    # Other config
}

Where should I pass this model_deployer then.

Adiii1436 avatar Jan 11 '24 18:01 Adiii1436

No other service is running on 8000 port.

  1. Before deploying.

Screenshot from 2024-01-11 23-52-19

  1. After running run_deployment.py

Screenshot from 2024-01-11 23-52-48

And stopping the deployment automatically.

Adiii1436 avatar Jan 11 '24 18:01 Adiii1436

@Adiii1436 Did you try to increase the timeout criteria before trying to deploy?

Vishal-Padia avatar Jan 12 '24 05:01 Vishal-Padia

Yes increased it, but not working. As soon as model is deployed, it is stopping the existing services. And nothing is there on 8000 port.

Adiii1436 avatar Jan 12 '24 06:01 Adiii1436

Hey @Adiii1436 , sorry about the delay. I analyzed the information you provided and cannot see any leads right away, unfortunately. I would be glad to analyze it further, but we had many changes here and there since 0.53.1, so it might be hard to retrospect and fix for current releases. So, can you upgrade to a recent version of ZenML and try again? Once this is done, I can pick it up from there.

Also a side note/question:

  1. Are you trying to deploy a torch model? Can you give a shot with something similar (aka sklearn) to localize that issue is not on torch specifics?
  2. In your predictor step, I suspect you need to call service.predict(X), not service(X)
@step
def predictor(
    service: MLFlowDeploymentService,
    data: DataLoader,
) -> float:
    
    test_acc = 0
    
    with torch.inference_mode():
        for X, y in data:
            test_pred = service(X)
            test_acc +=  accuracy_fn(y, test_pred.argmax(dim=1))
        
    test_acc /= len(data)
    return test_acc

avishniakov avatar Feb 21 '24 14:02 avishniakov