azure-sdk-for-python
azure-sdk-for-python copied to clipboard
Enforce consistency between Azure Machine Learning datasets API and Azure Machine Learning Models API
Currently AzureML Python SDK as well as the CLI does not support to filter datasets by using tags.
When doing :
from azureml.core import Run, Datastore, Workspace
from azureml.core import Dataset
ws = Workspace.from_config()
Dataset.get_all(ws)
You get only the latest version of a dataset.
When you do:
from azureml.core import Run, Datastore, Workspace, Model
from azureml.core import Dataset
ws = Workspace.from_config()
Model.list(ws)
You get all of the models versions and you can actually use Model.list(ws, tags=['key', ['key2', 'key2 value']]).
The behaviour with the Model API is what I expect. But the one with datasets is actually inconsistent.
In fact in the REST API it is possible to use the tags property:
https://learn.microsoft.com/en-us/rest/api/azureml/2022-06-01-preview/data-containers/list?tabs=HTTP
I suggest that we can do:
from azureml.core import Run, Datastore, Workspace, Model
from azureml.core import Dataset
ws = Workspace.from_config()
Dataset.list(ws)
Dataset.list(ws, tags=['key', ['key2', 'key2 value']])
BR E
Thank you for your feedback. This has been routed to the support team for assistance.
Hi @edgBR, thanks for the feedback. We'll get back to you asap(@azureml-github)
Hi @l0lawrence thanks for that, just to add in the expected functionality.
To get the latest version of a model given a tag:
try:
old_model = Model(workspace=workspace, name=args.model_name,
version=None, tags=[['deploy_flag', 'Y']]) # retrieves the latest
logging.info("Old model found")
To get the latest version of a dataset given a tag I am doing the following suboptimal workaround:
def get_training_history(self):
latest_version = Dataset.get_by_name(
self.ws, name=self.ml_datafile,
version='latest').version
for i in range(1, latest_version + 1):
df = Dataset.get_by_name(
self.ws, name=self.ml_datafile, version=str(i))
try:
self.training_dataset_names.append(df.name)
self.training_dataset_versions.append(df.version)
self.training_dataset_deploy_tags.append(df.tags['deploy_flag'])
except:
self.training_dataset_deploy_tags.append(np.nan)
self.training_version_df = pd.DataFrame(
{'dataset_name': self.training_dataset_names,
'dataset_version': self.training_dataset_versions,
'deploy_flag': self.training_dataset_deploy_tags})
logging.info("Training history created")
self.training_version_df.dropna(inplace=True, axis=0, how='any')
self.training_data_version = self.training_version_df.query('[email protected]_flag').groupby('dataset_name').agg(
last_version=('dataset_version', 'last')).reset_index()['last_version'][0]
Which is far away from usable.
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github, @Azure/azure-ml-sdk.
Issue Details
Currently AzureML Python SDK as well as the CLI does not support to filter datasets by using tags.
When doing :
from azureml.core import Run, Datastore, Workspace
from azureml.core import Dataset
ws = Workspace.from_config()
Dataset.get_all(ws)
You get only the latest version of a dataset.
When you do:
from azureml.core import Run, Datastore, Workspace, Model
from azureml.core import Dataset
ws = Workspace.from_config()
Model.list(ws)
You get all of the models versions and you can actually use Model.list(ws, tags=['key', ['key2', 'key2 value']]).
The behaviour with the Model API is what I expect. But the one with datasets is actually inconsistent.
In fact in the REST API it is possible to use the tags property:
https://learn.microsoft.com/en-us/rest/api/azureml/2022-06-01-preview/data-containers/list?tabs=HTTP
I suggest that we can do:
from azureml.core import Run, Datastore, Workspace, Model
from azureml.core import Dataset
ws = Workspace.from_config()
Dataset.list(ws)
Dataset.list(ws, tags=['key', ['key2', 'key2 value']])
BR E
Author: | edgBR |
---|---|
Assignees: | bandsina |
Labels: |
|
Milestone: | - |
@edgBR thx for your feedback. Have you tried AzureML v2 SDK? Features you described are already being implemented in v2 SDK. I can provide more details if you're interested. https://learn.microsoft.com/en-us/python/api/overview/azure/ml/installv2?view=azure-ml-py
Hi @luigiw do you have an ETA for the sdk2.0 to not be preview? Our team has been struggling with most of the the preview features of AML and we are quite skeptical of using non stable things.
@edgBR glad you asked ;). azure-ai-ml package is in GA as of version 1.0.0. https://pypi.org/project/azure-ai-ml/#history
Hi @edgBR. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve
” to remove the “issue-addressed” label and continue the conversation.
/unresolve
The fact that SDK 2.0 provides this functionality doesn't mean that is solved for the original environment.
It feels unrealistic to have to refactor all your code to comply with azureML SDK v2.0 just because 1.x has API inconsistency.
Hi, any update on this?
@edgBR thx for your feedback. As AzureML is migrating to v2 CLI/SDK, we'll not accept feature requests on v1 anymore.
Hi @luigiw, I'm sorry but this does not look reasonable.
The SDK 2.0 lacks of a lot of functionalities like publishing ML pipelines as endpoints, it lacks a proper example of datadrift in the solutions accelerator and honestly there are still a lot of things missing!!!
As a customer of Azure my company and I are really astonished about this behaviour.
Are you really not going to fix this issue in 1.0?
@edgBR I really appreciate your feedback. Pipeline jobs as endpoints are coming up in the next a couple of months in v2. I'd love to know what other gaps you're facing that block you from upgrading to v2. If you can open a separate issue for that, that will help us a lot.
I'll argue the original issue you reported is more of a feature request than bug, which is hard to get teams within AzureML to invest in.