azure-sdk-for-python icon indicating copy to clipboard operation
azure-sdk-for-python copied to clipboard

Enforce consistency between Azure Machine Learning datasets API and Azure Machine Learning Models API

Open edgBR opened this issue 2 years ago • 3 comments

Currently AzureML Python SDK as well as the CLI does not support to filter datasets by using tags.

When doing :

from azureml.core import Run, Datastore, Workspace
from azureml.core import Dataset
ws = Workspace.from_config()
Dataset.get_all(ws)

You get only the latest version of a dataset.

When you do:

from azureml.core import Run, Datastore, Workspace, Model
from azureml.core import Dataset
ws = Workspace.from_config()
Model.list(ws)

You get all of the models versions and you can actually use Model.list(ws, tags=['key', ['key2', 'key2 value']]).

The behaviour with the Model API is what I expect. But the one with datasets is actually inconsistent.

In fact in the REST API it is possible to use the tags property:

https://learn.microsoft.com/en-us/rest/api/azureml/2022-06-01-preview/data-containers/list?tabs=HTTP

I suggest that we can do:

from azureml.core import Run, Datastore, Workspace, Model
from azureml.core import Dataset
ws = Workspace.from_config()
Dataset.list(ws)
Dataset.list(ws, tags=['key', ['key2', 'key2 value']])

BR E

edgBR avatar Sep 20 '22 14:09 edgBR

Thank you for your feedback. This has been routed to the support team for assistance.

ghost avatar Sep 20 '22 15:09 ghost

Hi @edgBR, thanks for the feedback. We'll get back to you asap(@azureml-github)

l0lawrence avatar Sep 20 '22 15:09 l0lawrence

Hi @l0lawrence thanks for that, just to add in the expected functionality.

To get the latest version of a model given a tag:

try:
            old_model = Model(workspace=workspace, name=args.model_name,
                              version=None, tags=[['deploy_flag', 'Y']])  # retrieves the latest
            logging.info("Old model found")

To get the latest version of a dataset given a tag I am doing the following suboptimal workaround:

def get_training_history(self):
        latest_version = Dataset.get_by_name(
            self.ws, name=self.ml_datafile,
            version='latest').version

        for i in range(1, latest_version + 1):
            df = Dataset.get_by_name(
                self.ws, name=self.ml_datafile, version=str(i))
            try:
               self.training_dataset_names.append(df.name)
               self.training_dataset_versions.append(df.version)
               self.training_dataset_deploy_tags.append(df.tags['deploy_flag'])
            except:
                self.training_dataset_deploy_tags.append(np.nan)

        self.training_version_df = pd.DataFrame(
            {'dataset_name': self.training_dataset_names, 
            'dataset_version': self.training_dataset_versions, 
            'deploy_flag': self.training_dataset_deploy_tags})

        logging.info("Training history created")

        self.training_version_df.dropna(inplace=True, axis=0, how='any')

        self.training_data_version = self.training_version_df.query('[email protected]_flag').groupby('dataset_name').agg(
            last_version=('dataset_version', 'last')).reset_index()['last_version'][0]

Which is far away from usable.

edgBR avatar Sep 21 '22 16:09 edgBR

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github, @Azure/azure-ml-sdk.

Issue Details

Currently AzureML Python SDK as well as the CLI does not support to filter datasets by using tags.

When doing :

from azureml.core import Run, Datastore, Workspace
from azureml.core import Dataset
ws = Workspace.from_config()
Dataset.get_all(ws)

You get only the latest version of a dataset.

When you do:

from azureml.core import Run, Datastore, Workspace, Model
from azureml.core import Dataset
ws = Workspace.from_config()
Model.list(ws)

You get all of the models versions and you can actually use Model.list(ws, tags=['key', ['key2', 'key2 value']]).

The behaviour with the Model API is what I expect. But the one with datasets is actually inconsistent.

In fact in the REST API it is possible to use the tags property:

https://learn.microsoft.com/en-us/rest/api/azureml/2022-06-01-preview/data-containers/list?tabs=HTTP

I suggest that we can do:

from azureml.core import Run, Datastore, Workspace, Model
from azureml.core import Dataset
ws = Workspace.from_config()
Dataset.list(ws)
Dataset.list(ws, tags=['key', ['key2', 'key2 value']])

BR E

Author: edgBR
Assignees: bandsina
Labels:

question, Machine Learning, Service Attention, Mgmt, customer-reported

Milestone: -

ghost avatar Oct 14 '22 15:10 ghost

@edgBR thx for your feedback. Have you tried AzureML v2 SDK? Features you described are already being implemented in v2 SDK. I can provide more details if you're interested. https://learn.microsoft.com/en-us/python/api/overview/azure/ml/installv2?view=azure-ml-py

luigiw avatar Oct 14 '22 16:10 luigiw

Hi @luigiw do you have an ETA for the sdk2.0 to not be preview? Our team has been struggling with most of the the preview features of AML and we are quite skeptical of using non stable things.

edgBR avatar Oct 14 '22 17:10 edgBR

@edgBR glad you asked ;). azure-ai-ml package is in GA as of version 1.0.0. https://pypi.org/project/azure-ai-ml/#history

luigiw avatar Oct 14 '22 17:10 luigiw

Hi @edgBR. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

ghost avatar Oct 19 '22 19:10 ghost

/unresolve

The fact that SDK 2.0 provides this functionality doesn't mean that is solved for the original environment.

It feels unrealistic to have to refactor all your code to comply with azureML SDK v2.0 just because 1.x has API inconsistency.

edgBR avatar Oct 19 '22 21:10 edgBR

Hi, any update on this?

edgBR avatar Oct 27 '22 13:10 edgBR

@edgBR thx for your feedback. As AzureML is migrating to v2 CLI/SDK, we'll not accept feature requests on v1 anymore.

luigiw avatar Oct 27 '22 22:10 luigiw

Hi @luigiw, I'm sorry but this does not look reasonable.

The SDK 2.0 lacks of a lot of functionalities like publishing ML pipelines as endpoints, it lacks a proper example of datadrift in the solutions accelerator and honestly there are still a lot of things missing!!!

As a customer of Azure my company and I are really astonished about this behaviour.

Are you really not going to fix this issue in 1.0?

edgBR avatar Nov 10 '22 17:11 edgBR

@edgBR I really appreciate your feedback. Pipeline jobs as endpoints are coming up in the next a couple of months in v2. I'd love to know what other gaps you're facing that block you from upgrading to v2. If you can open a separate issue for that, that will help us a lot.

I'll argue the original issue you reported is more of a feature request than bug, which is hard to get teams within AzureML to invest in.

luigiw avatar Nov 10 '22 21:11 luigiw