azure-sdk-for-python icon indicating copy to clipboard operation
azure-sdk-for-python copied to clipboard

Intermittent dropping of files while mounting FileDataset in local machine.

Open deepankersingh96 opened this issue 3 years ago • 4 comments

  • Package Name: azureml-core
  • Package Version: 1.44.0
  • Operating System: Ubuntu 18.04.6 LTS
  • Python Version: Python 3.6.13 |Anaconda, Inc.| (default, Jun 4 2021, 14:25:59)

Describe the bug The task is fairly simple. I have uploaded Synscapes dataset (https://synscapes.on.liu.se/download.html) to Azure Blob Storage and have registered the same in my AzureML Studio datastores. The dataset contains 25000 image ".png" files that needs to be mounted to the local machine and then on each image a small feature extraction using OpenCV needs to be done. The extracted features then are stored as a .pckl file. Out of the 25000 image files some files at random are not found in the mount path as suggested by the OpenCV error: [ WARN:[email protected]] global /io/opencv/modules/imgcodecs/src/loadsave.cpp (239) findDecoder imread_('<path/to/file/filename>.png'): can't open/read file: check file path/integrity

Refer this code snippet for details:

from azureml.core import Workspace, Dataset
workspace = Workspace(subscription_id, resource_group, workspace_name)
datastorage = Dataset.get_by_name(workspace, name='<azureml_datastore_name>')

with datastorage.mount() as mount_context:
    features = []
    mount_path = mount_context.mount_point

    for image_path in image_paths:
        image = cv2.imread(os.path.join(mount_path, label_path), cv2.IMREAD_UNCHANGED)
        features.append(get_feature(image))

    with open(save_path, "wb") as handle:
        pickle.dump(features , handle, protocol = pickle.HIGHEST_PROTOCOL)

Output: After running successfully for few thousand images, OpenCV throws the following OpenCV error: [ WARN:[email protected]] global /io/opencv/modules/imgcodecs/src/loadsave.cpp (239) findDecoder imread_('<path/to/file/filename>.png'): can't open/read file: check file path/integrity

I have veryfied that the image file exists and is not corrupted. One thing that verifies this claim is that a given image gets loaded in few runs while throws error in other runs.

To Reproduce Steps to reproduce the behavior:

  1. Register Synscapes dataset to Azure Blob Storage.
  2. Register the same to Azure datastorage.
  3. Mount data to local machine as shown in the code snippet above.
  4. Wait for the errors to be thrown.

Expected behavior All the 25000 images (in this scenario) should be accessible on the local machine after mounting the dataset from AzureML Studio.

deepankersingh96 avatar Nov 24 '22 09:11 deepankersingh96

Thanks for the feedback, we’ll investigate asap.

xiangyan99 avatar Nov 28 '22 17:11 xiangyan99

@azureml-github

xiangyan99 avatar Nov 28 '22 17:11 xiangyan99

@deepankersingh96 as a first thing to try is do upgrade to latest SDK version, which is currently on 1.48, it does have few bug fixes and major rewrite of mount logic. Additionally it's important to know how your file dataset was defined, could you please share output of Dataset.get_by_name(workspace, name='<azureml_datastore_name>')._dataflow._steps

anliakho2 avatar Dec 07 '22 21:12 anliakho2

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Dec 23 '22 02:12 ghost