yocto-gl
yocto-gl copied to clipboard
[BUG] log_model in azure ml failing but model still present in the model registry
Issues Policy acknowledgement
- [X] I have read and agree to submit bug reports in accordance with the issues policy
Where did you encounter this bug?
Azure Machine Learning
Willingness to contribute
No. I cannot contribute a bug fix at this time.
MLflow version
- Client: 2.10.0
System information
- Python 3.11
Describe the problem
I have an azure ml job which trains an sklearn model. Following azureml-examples, I want the model to be saved in model registry (from inside of the job, like in the part Reading and writing model in a job
) and returned as an output. I run:
mlflow.sklearn.log_model(
sk_model=model,
artifact_path=model_output,
registered_model_name=model_name,
)
and the result is that the model is saved in the model registry but the job fails with an error:
UserErrorException:
Message: Model asset creation API failed with {'additional_properties': {'details': [{'code': 'ModelAssetPathNotFoundInStorage', 'message': 'No blobs found in storage at model asset path: azureml/4319dfec-3b63-472d-a27c-656c56197170/model_output/'}], 'message': 'The request is invalid.', 'statusCode': 400, 'code': 'BadRequest'}, 'error': <data_capability._restclient.model.models._models_py3.RootError object at 0x14d8501c5490>, 'correlation': {'operation': '378f68a2a921e81a0b51ce367ed9d501', 'request': 'c0318bc9200d9941', 'RequestId': 'c0318bc9200d9941'}, 'environment': 'westeurope', 'location': 'westeurope', 'time': datetime.datetime(2024, 2, 13, 10, 39, 18, 806659, tzinfo=<FixedOffset '+00:00'>), 'component_name': 'modelregistry'}
InnerException None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Model asset creation API failed with {'additional_properties': {'details': [{'code': 'ModelAssetPathNotFoundInStorage', 'message': 'No blobs found in storage at model asset path: azureml/4319dfec-3b63-472d-a27c-656c56197170/model_output/'}], 'message': 'The request is invalid.', 'statusCode': 400, 'code': 'BadRequest'}, 'error': <data_capability._restclient.model.models._models_py3.RootError object at 0x14d8501c5490>, 'correlation': {'operation': '378f68a2a921e81a0b51ce367ed9d501', 'request': 'c0318bc9200d9941', 'RequestId': 'c0318bc9200d9941'}, 'environment': 'westeurope', 'location': 'westeurope', 'time': datetime.datetime(2024, 2, 13, 10, 39, 18, 806659, tzinfo=<FixedOffset '+00:00'>), 'component_name': 'modelregistry'}"
}
}
(second, invisible warning is the same as the error)
The code attached below is run as a command job in an azure ml pipeline.
Tracking information
System information: Linux #61~20.04.1-Ubuntu SMP Tue Nov 21 17:50:57 UTC 2023
Python version: 3.11.7
MLflow version: 2.10.0
MLflow module location: /opt/conda/envs/ptca/lib/python3.11/site-packages/mlflow/__init__.py
Tracking URI: azureml://westeurope.api.azureml.ms/mlflow/v1.0/subscriptions/466c9654-1c8f-4bf5-95ba-c464c64aa485/resourceGroups/Hobbits-AI-Lab/providers/Microsoft.MachineLearningServices/workspaces/mordorml
Registry URI: azureml://westeurope.api.azureml.ms/mlflow/v1.0/subscriptions/466c9654-1c8f-4bf5-95ba-c464c64aa485/resourceGroups/Hobbits-AI-Lab/providers/Microsoft.MachineLearningServices/workspaces/mordorml
Active experiment ID: e6f0e63a-8430-4a04-bc15-25e98d991ca1
Active run ID: 5eb9ffc6-517b-4a25-8e9a-bdf10d50f0fe
Active run artifact URI: azureml://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/466c9654-1c8f-4bf5-95ba-c464c64aa485/resourceGroups/Hobbits-AI-Lab/providers/Microsoft.MachineLearningServices/workspaces/mordorml/experiments/e6f0e63a-8430-4a04-bc15-25e98d991ca1/runs/5eb9ffc6-517b-4a25-8e9a-bdf10d50f0fe/artifacts
MLflow environment variables:
MLFLOW_DISABLE_ENV_MANAGER_CONDA_WARNING: True
MLFLOW_EXPERIMENT_ID: e6f0e63a-8430-4a04-bc15-25e98d991ca1
MLFLOW_EXPERIMENT_NAME: train_fa_predictor_pipeline
MLFLOW_TRACKING_TOKEN: eyJhbGciOiJSUzI1NiIsImtpZCI6IjA3RTU0ODI2RjE1ODI4N0M0OUU5QjlGMDZFMkM5RDYyNUM2Q0MyOTIiLCJ0eXAiOiJKV1QifQ.eyJyb2xlIjoiQ29udHJpYnV0b3IiLCJzY29wZSI6Ii9zdWJzY3JpcHRpb25zLzQ2NmM5NjU0LTFjOGYtNGJmNS05NWJhLWM0NjRjNjRhYTQ4NS9yZXNvdXJjZUdyb3Vwcy9Ib2JiaXRzLUFJLUxhYi9wcm92aWRlcnMvTWljcm9zb2Z0Lk1hY2hpbmVMZWFybmluZ1NlcnZpY2VzL3dvcmtzcGFjZXMvbW9yZG9ybWwiLCJhY2NvdW50aWQiOiIwMDAwMDAwMC0wMDAwLTAwMDAtMDAwMC0wMDAwMDAwMDAwMDAiLCJ3b3Jrc3BhY2VJZCI6ImI4MTA0ZTNmLTU1YzMtNGE1NS04ZDk1LTEzYmRjNWZiYjVjMSIsInByb2plY3RpZCI6IjAwMDAwMDAwLTAwMDAtMDAwMC0wMDAwLTAwMDAwMDAwMDAwMCIsImRpc2NvdmVyeSI6InVyaTovL2Rpc2NvdmVyeXVyaS8iLCJ0aWQiOiIxYjE2YWIzZS1iOGY2LTRmZTMtOWYzZS0yZGI3ZmU1NDlmNmEiLCJvaWQiOiJmNzMxOGVmOS0xOWMwLTRmMTktYjRlMi04YjUxMDY3MjNjMWMiLCJwdWlkIjoiMTAwMzIwMDMwQ0RFOUMzQyIsImlzcyI6ImF6dXJlbWwiLCJhcHBpZCI6IkpVUkRaSU5TS0kgR3J6ZWdvcnoiLCJleHAiOjE3MDk2NDczNTUsImF1ZCI6ImF6dXJlbWwifQ.OOTVOeIXLLSEqJCr_gQCnfYZMjUE3ConQneaBfn1yzCsmG8ZnGjXSfJvKImLg44eA7jelqgLN9vkTDqMcvhbNskh1xAQYf6OqrhLx-W7gTleasWgeW-NYIxq3s48JD3ylsyk61l6RKq9V-h3QHsb_NWWnycykoWkNwVVLhGWgNL0dzPlcE5_47YhAyUWZYKIVtr5t2ZQC6lcb7FfN4I3cjvnMuQN7LDVUw4gY5wBCND6LWGgGz5k-Ahu9LQI5pRZF67n9KBHK184UzD6sIkfQFcWqjmgPO5BJ2QDDvOWCqDfVsPV6tEBF6gwupupyCfNQuN6wY2LtPORk6AOVVVWDA
MLFLOW_TRACKING_URI: azureml://westeurope.api.azureml.ms/mlflow/v1.0/subscriptions/466c9654-1c8f-4bf5-95ba-c464c64aa485/resourceGroups/Hobbits-AI-Lab/providers/Microsoft.MachineLearningServices/workspaces/mordorml
MLflow dependencies:
Flask: 3.0.2
Jinja2: 3.1.3
aiohttp: 3.9.3
alembic: 1.13.1
azure-storage-file-datalake: 12.14.0
click: 8.1.7
cloudpickle: 3.0.0
databricks-cli: 0.18.0
docker: 7.0.0
entrypoints: 0.4
fastapi: 0.104.1
gitpython: 3.1.41
gunicorn: 21.2.0
importlib-metadata: 7.0.1
markdown: 3.3.7
matplotlib: 3.5.2
numpy: 1.26.4
packaging: 23.2
pandas: 2.1.4
protobuf: 3.20.2
pyarrow: 14.0.1
pydantic: 2.5.2
pytz: 2023.4
pyyaml: 6.0
querystring-parser: 1.2.4
requests: 2.31.0
scikit-learn: 1.3.0
scipy: 1.12.0
sqlalchemy: 2.0.26
sqlparse: 0.4.4
tiktoken: 0.5.2
uvicorn: 0.22.0
virtualenv: 20.25.0
Code to reproduce issue
import mlflow.sklearn
import sklearn.preprocessing
import typer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import make_pipeline
def train(
model_output: str = typer.Argument(
...,
help="Path where to save a model.",
),
model_name: str = typer.Option(
default="fa_predictor",
help="Name used to save trained model.",
),
) -> None:
# Load data
scaler = sklearn.preprocessing.StandardScaler()
model = make_pipeline(
scaler, GradientBoostingClassifier(loss="log_loss", learning_rate=0.1, n_estimators=100, max_depth=3)
)
# Fit model
mlflow.sklearn.log_model(
sk_model=model,
artifact_path=model_output,
registered_model_name=model_name,
)
if __name__ == "__main__":
typer.run(train)
Stack trace
2024/02/13 11:12:45 WARNING mlflow.models.model: Logging model metadata to the tracking server has failed. The model artifacts have been logged successfully under azureml://westeurope.api.azureml.ms/mlflow/v2.0/subscriptions/466c9654-1c8f-4bf5-95ba-c464c64aa485/resourceGroups/Hobbits-AI-Lab/providers/Microsoft.MachineLearningServices/workspaces/mordorml/experiments/e6f0e63a-8430-4a04-bc15-25e98d991ca1/runs/74dcf196-82bc-4674-8e97-90fd57eab2b7/artifacts. Set logging level to DEBUG via `logging.getLogger("mlflow").setLevel(logging.DEBUG)` to see the full traceback.
2024/02/13 11:12:45 DEBUG mlflow.models.model:
Traceback (most recent call last):
File "/opt/conda/envs/ptca/lib/python3.11/site-packages/mlflow/models/model.py", line 625, in log
mlflow.tracking.fluent._record_logged_model(mlflow_model, run_id)
File "/opt/conda/envs/ptca/lib/python3.11/site-packages/mlflow/tracking/fluent.py", line 1348, in _record_logged_model
MlflowClient()._record_logged_model(run_id, mlflow_model)
File "/opt/conda/envs/ptca/lib/python3.11/site-packages/mlflow/tracking/client.py", line 1782, in _record_logged_model
self._tracking_client._record_logged_model(run_id, mlflow_model)
File "/opt/conda/envs/ptca/lib/python3.11/site-packages/mlflow/tracking/_tracking_service/client.py", line 494, in _record_logged_model
self.store.record_logged_model(run_id, mlflow_model)
File "/opt/conda/envs/ptca/lib/python3.11/site-packages/mlflow/store/tracking/rest_store.py", line 327, in record_logged_model
self._call_endpoint(LogModel, req_body)
File "/opt/conda/envs/ptca/lib/python3.11/site-packages/mlflow/store/tracking/rest_store.py", line 59, in _call_endpoint
return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/ptca/lib/python3.11/site-packages/mlflow/utils/rest_utils.py", line 220, in call_endpoint
response = verify_rest_response(response, endpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/ptca/lib/python3.11/site-packages/mlflow/utils/rest_utils.py", line 152, in verify_rest_response
raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: Response: {'Error': {'Code': 'ValidationError', 'Severity': None, 'Message': 'The request is invalid.', 'MessageFormat': None, 'MessageParameters': None, 'ReferenceCode': None, 'DetailsUri': None, 'Target': None, 'Details': [], 'InnerError': None, 'DebugInfo': None, 'AdditionalInfo': None}, 'Correlation': {'operation': 'fca73f4adab629c35202e3e02505e070', 'request': '08a2656c522829df'}, 'Environment': 'westeurope', 'Location': 'westeurope', 'Time': '2024-02-13T11:12:45.869904+00:00', 'ComponentName': 'mlflow', 'statusCode': 400, 'error_code': 'INVALID_PARAMETER_VALUE'}
Registered model 'fa_predictor' already exists. Creating a new version of this model...
Other info / logs
No response
What component(s) does this bug affect?
- [X]
area/artifacts
: Artifact stores and artifact logging - [ ]
area/build
: Build and test infrastructure for MLflow - [ ]
area/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrations - [ ]
area/docs
: MLflow documentation pages - [ ]
area/examples
: Example code - [X]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ]
area/models
: MLmodel format, model serialization/deserialization, flavors - [ ]
area/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templates - [ ]
area/projects
: MLproject format, project running backends - [ ]
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - [ ]
area/server-infra
: MLflow Tracking server backend - [ ]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- [ ]
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [ ]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - [ ]
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - [ ]
area/windows
: Windows support
What language(s) does this bug affect?
- [ ]
language/r
: R APIs and clients - [ ]
language/java
: Java APIs and clients - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [X]
integrations/azure
: Azure and Azure ML integrations - [ ]
integrations/sagemaker
: SageMaker integrations - [ ]
integrations/databricks
: Databricks integrations
@akshaya-a @santiagxf would you mind taking a look here? Thank you! :)
Sure we are happy to take a look. @gjurdzinski-deepsense can you tell us what's the value you are passing on model_output
variable?
Thanks! When creating the job I'm passing Output(type=AssetTypes.CUSTOM_MODEL)
. When the job is running, the value of model_output
variable is just "model_output"
.
Let me see if I can repro and will get back to you.
@gjurdzinski-deepsense I don't see azureml-mlflow
in the list of packages installed. Can you share that you installed AzureML MLflow plugin?
@santiagxf I checked with pip list
, it's installed:
Package Version
--------------------------- -----------
...
azureml-mlflow 1.55.0
...
For the context of the whole installation. I base my docker image on mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7:21
, which comes with python 3.8 and conda. I use python 3.11.7 and poetry in my project, so my dockerfile looks like this:
FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7:21
...
RUN conda install -y python=3.11.7
...
COPY poetry.lock pyproject.toml ./
RUN poetry install
...
Thanks for the reply. Unfortunately, I couldn't reproduce the issue on my end. Can you please share the environment definition, code, and way you are generating the job so we can have a look? Alternatively, I'm sharing here an example very similar to what yo are doing. Can you validate if you are doing something different?
The job definition is as follows:
job.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
experiment_name: mlflow-log-model
environment:
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04:latest
conda_file: conda.yml
code: train.py
command: pyrunit train.py train_model --input-data ${{inputs.input_data}} --model-path ${{inputs.model_path}} --registered-model-name ${{inputs.registered_model_name}}
inputs:
model_path: model
registered_model_name: heart-classifier-pipeline
input_data:
type: uri_file
path: https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci/data/heart.csv
resources:
instance_count: 1
The environment libraries are as follows:
conda.yml
channels:
- conda-forge
dependencies:
- python=3.11.7
- pip
- pip:
- mlflow
- azureml-mlflow
- datasets
- jobtools
- cloudpickle==3.0.0
- scikit-learn==1.4.0
- scipy==1.12.0
- xgboost==2.0.3
name: mlflow-env
The training code is as follows:
train.py
# %%
import mlflow
import pandas as pd
from mlflow.models import infer_signature
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
# %%
def train_model(input_data: str, model_path: str, registered_model_name: str = None):
with mlflow.start_run():
mlflow.xgboost.autolog(log_models=False)
df = pd.read_csv(input_data)
X_train, X_test, y_train, y_test = train_test_split(
df.drop("target", axis=1), df["target"], test_size=0.3
)
encoder = ColumnTransformer(
[
(
"cat_encoding",
OrdinalEncoder(
categories="auto",
handle_unknown='use_encoded_value',
unknown_value=-1,
encoded_missing_value=-1,
),
["thal"],
)
],
remainder="passthrough",
verbose_feature_names_out=False,
)
model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
pipeline = Pipeline(steps=[("encoding", encoder), ("model", model)])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
mlflow.log_metric("test_accuracy", accuracy)
mlflow.log_metric("test_recall", recall)
signature = infer_signature(X_test, y_test)
mlflow.sklearn.log_model(pipeline,
artifact_path=model_path,
signature=signature,
registered_model_name=registered_model_name)
You can run this example with:
az ml job create -f job.yml
The example files are in the following zip for your convenience: job.zip
@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.
I'll try to narrow down our code to the pieces I can share but I can already point out few things:
- We don't define the job with an .yml file. We do it more less like this (I'll prepare a working example later):
# Define local variables ml_client, compute_cluster, ...
def prepare_train_predictor_component(
code_path: Path,
ml_client: MLClient,
compute_cluster: AmlCompute,
environment: Environment,
model_name: str = "predictor",
) -> Component:
script_name = "train_predictor.py"
train_component = command(
name="train_predictor",
display_name="Train Predictor",
description="Trains Predictor",
inputs={
"dataset_path": Input(type="uri_folder", mode="ro_mount"),
"train_csv_filename": Input(type="string", default="train.csv"),
"test_csv_filename": Input(type="string", default="test.csv"),
"model_name": Input(type="string", default=model_name),
},
outputs={
"model_output": Output(type=AssetTypes.CUSTOM_MODEL),
},
code=code_path,
command=" ".join(
[
f"python {script_name}",
"${{inputs.dataset_path}}/${{inputs.train_csv_filename}}",
"${{inputs.dataset_path}}/${{inputs.test_csv_filename}}",
"${{outputs.model_output}}",
"--model-name ${{inputs.model_name}}",
]
),
environment=f"{environment.name}:{environment.version}",
compute=compute_cluster.name,
)
component: Component = ml_client.create_or_update(train_component.component)
return component
train_component = prepare_train_predictor_component(
ml_client,
compute_cluster,
environment,
model_name,
)
@dsl.pipeline(
compute=compute_cluster.name,
description="Training Pipeline",
)
def train_predictor_pipeline(
dataset_path: Input,
train_csv_filename: Input,
test_csv_filename: Input,
model_type: str,
model_name: str,
) -> PipelineJob:
"""Defines Azure
train_job = train_component(
dataset_path=dataset_path,
train_csv_filename=train_csv_filename,
test_csv_filename=test_csv_filename,
model_type=model_type,
model_name=model_name,
)
train_job.outputs.model_output = Output(type=AssetTypes.CUSTOM_MODEL)
return {}
pipeline = train_predictor_pipeline(
dataset_path=Input(
type="uri_folder",
path=os.path.join(azureml_storage_path, dataset_path),
mode="ro_mount",
),
train_csv_filename=train_csv_filename,
test_csv_filename=test_csv_filename,
model_name=model_name,
)
pipeline_job = ml_client.jobs.create_or_update(pipeline, experiment_name="train_predictor_pipeline")
ml_client.jobs.stream(pipeline_job.name)
- I added
with mlflow.start_run()
to the training script but it fails with:
UnsupportedModelRegistryStoreURIException: Model registry functionality is unavailable; got unsupported URI 'azureml://westeurope.api.azureml.ms/mlflow/v1.0/subscriptions/<redacted>/resourceGroups/<redacted>/providers/Microsoft.MachineLearningServices/workspaces/<redacted>' for model registry data storage. Supported URI schemes are: ['', 'file', 'databricks', 'databricks-uc', 'http', 'https', 'postgresql', 'mysql', 'sqlite', 'mssql']. See https://www.mlflow.org/docs/latest/tracking.html#storage for how to run an MLflow server against one of the supported backend storage locations.
Thanks for sharing! Using the Python SDK to create pipelines and jobs (looks you are using Azure ML pipelines) is completely supported. Based on the later error message I think we can know what's going on. The azure-mlflow
plugin is not correctly installed in your environment. You see that the protocol azureml
is not being recognized. I suggest to review the environment that is being used on each of the steps of the pipeline and make sure the environment has the right dependencies.
I finally fixed it and turns out the problem was completely elsewhere. The job was ok, the pipeline was failing – the pipeline was setting the job output to be Output(type=AssetTypes.CUSTOM_MODEL)
and the asset creation was failing. That's why I could see the model in model registry (job succeeded saving it there), the failure happened later.
Thanks for your time and support!
Glad you solved the problem. I still think there was an error somewhere else because your logs were showing a very clear message. We are always looking for ways to help users find errors easier. If you can share the complete example we can take a look. Thanks!
The error was not on the mlflow side. Asset creation in Azure ML was failing when I was defining Output
of type CUSTOM_MODEL
.
I finally fixed it and turns out the problem was completely elsewhere. The job was ok, the pipeline was failing – the pipeline was setting the job output to be
Output(type=AssetTypes.CUSTOM_MODEL)
and the asset creation was failing. That's why I could see the model in model registry (job succeeded saving it there), the failure happened later.Thanks for your time and support!
I have the same issue, trying to connect a few jobs in a pipeline. My output of job No. 1 is an MLFLOW type model which should go to the input of job No. 2. Could you elaborate on what exactly was the problem in your case and how did you fix it? Was it just a change from CUSTOM_MODEL to MLFLOW_MODEL or there was some other issue? Thanks in advance.