azure-sdk-for-python icon indicating copy to clipboard operation
azure-sdk-for-python copied to clipboard

[Azure ML SDK v2] Issue while reading data from `uri_folder` Input type via https://<account_name> scheme

Open glebrh opened this issue 2 years ago • 9 comments

  • Package Name: azure-ai-ml
  • Package Version: 1.0.0
  • Operating System: Windows Server 2022 Standard
  • Python Version: 3.9.13

Describe the bug According to the documentation it should be possible to access public blob storage containers using Input(type='uri_folder') instance. While passing actual path of the data, azure docs say that it is possible to use either https://<account_name>.blob.core.windows.net/<container_name>/<path> or abfss://<file_system>@<account_name>.dfs.core.windows.net/<path> path format

I tried to use the first option (https://) with diabetes dataset, which is available under the following link: https://azureopendatastorage.blob.core.windows.net/mlsamples/diabetes. However, this access method causes error like below:

{"NonCompliant":"DataAccessError(NotFound)"}
{
  "code": "data-capability.UriMountSession.PyFuseError",
  "target": "",
  "category": "UserError",
  "error_details": [
    {
      "key": "NonCompliantReason",
      "value": "DataAccessError(NotFound)"
    },
    {
      "key": "StackTrace",
      "value": "  File \"/opt/miniconda/envs/data-capability/lib/python3.7/site-packages/data_capability/capability_session.py\", line 70, in start\n    (data_path, sub_data_path) = session.start()\n\n  File \"/opt/miniconda/envs/data-capability/lib/python3.7/site-packages/data_capability/data_sessions.py\", line 364, in start\n    options=mnt_options\n\n  File \"/opt/miniconda/envs/data-capability/lib/python3.7/site-packages/azureml/dataprep/fuse/dprepfuse.py\", line 696, in rslex_uri_volume_mount\n    raise e\n\n  File \"/opt/miniconda/envs/data-capability/lib/python3.7/site-packages/azureml/dataprep/fuse/dprepfuse.py\", line 690, in rslex_uri_volume_mount\n    mount_context = RslexDirectURIMountContext(mount_point, uri, options)\n"
    }
  ]
}


AzureMLCompute job failed.
data-capability.UriMountSession.PyFuseError: [REDACTED]
  Reason: [REDACTED]
  StackTrace:   File "/opt/miniconda/envs/data-capability/lib/python3.7/site-packages/data_capability/capability_session.py", line 70, in start
    (data

With the second option, i.e. wasbs://[email protected]/diabetes job finishes successfully

To Reproduce Steps to reproduce the behavior: Execute the following code:

ml_client = MLClient(...)

job = command(
    command="ls ${{inputs.diabetes}}",
    inputs={
        "diabetes": Input(
            type="uri_folder",
            path="https://azureopendatastorage.blob.core.windows.net/mlsamples/diabetes",
        )
    },
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
    compute="cpu-cluster",
    display_name="data_access_test",
    # description,
    experiment_name="data_access_test"
)

ml_client.create_or_update(job)

Expected behavior Job will complete successfully. User logs will show the list of files inside passed blob storage folder

glebrh avatar Nov 05 '22 14:11 glebrh

Label prediction was below confidence level 0.6 for Model:ServiceLabels: 'Storage:0.21718025,Azure.Core:0.16009027,Data Lake Storage Gen2:0.13490766'

azure-sdk avatar Nov 05 '22 14:11 azure-sdk

BTW with uri_file option works normally with paths like this: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv

glebrh avatar Nov 05 '22 14:11 glebrh

@azureml-github

xiangyan99 avatar Nov 07 '22 17:11 xiangyan99

+1 I am facing this issue as well (can't use https:// as the path for the URI_FOLDER). I had to switch to using URI_FILE instead of URI_FOLDER as the initial input in my pipeline code as a workaround.

AndrewRTsao avatar Nov 27 '22 05:11 AndrewRTsao

Thx for reporting this. We'll investigate and get back to you.

luigiw avatar Nov 29 '22 18:11 luigiw

Hi, for uri_folder, please use wasbs schemed uri if its blob storage(wasbs://@.blob.core.windows.net/<path_to_data>/)

or abfs(abfss://@.dfs.core.windows.net/<path_to_data>/) if its adlsgen2 storage

QianqianNie avatar Nov 29 '22 21:11 QianqianNie

Then the documentation should be updated, I guess? Wherever it is mentioned that access to uri_folder is possible via https protocol, it should be removed?

For instance here or here

Or eventually, support for https + uri_folder will be added?

glebrh avatar Nov 29 '22 21:11 glebrh

I think this will be a document improvement.

luigiw avatar Dec 16 '22 00:12 luigiw

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Dec 23 '22 02:12 ghost

I am still facing this issue while submitting an Azure ML Job, using URI Folder with azure blob container storage as Input to the Job -> https://<account_name>.blob.core.windows.net/<container_name>/ image

if I created an Azure ML Data Asset or Azure ML Datastore for the same Azure blob container storage path, the job is starting without any problems.

ylnhari avatar Mar 29 '23 03:03 ylnhari

Hi all, I am facing issue when trying to read csv file stored on my github repository into azure ml. It throws following error: image

madhuyadu avatar Apr 20 '23 12:04 madhuyadu

I found this problem interesting. It seems that you have to register the datastore with the subscription and resource group where the data is located. There should be a streamlined solution for this. image

tahhnik avatar Jun 24 '23 06:06 tahhnik