azure-docs icon indicating copy to clipboard operation
azure-docs copied to clipboard

mapping between downloaded images and pandas data frame rows

Open kiranpradeep opened this issue 3 years ago • 1 comments

This page has belwo code sample to download images in a dataset and also to convert to pandas dataframe. I like to confirm if indexes in animal_pd (pandas dataframe) and download_path (list) match each other. For e.g. will download_path[0] always correspond to first entry in animal_pd ?

Is there a guaranatee that orders of rows in to_pandas_dataframe() will be same as order of elements in list returned by .download() ? Without this ordering, we cannot really map between dataframe and downloaded images.

import azureml.core
from azureml.core import Dataset, Workspace

# get animal_labels dataset from the workspace
animal_labels = Dataset.get_by_name(workspace, 'animal_labels')
animal_pd = animal_labels.to_pandas_dataframe()

# download the images to local 
download_path = animal_labels.download(stream_column='image_url') 

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

#read images from downloaded path
img = mpimg.imread(download_path[0])
imgplot = plt.imshow(img)

Document Details

Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.

kiranpradeep avatar Nov 11 '22 05:11 kiranpradeep

@kiranpradeep Thanks for your feedback! We will investigate and update as appropriate.

SaibabaBalapur-MSFT avatar Nov 11 '22 15:11 SaibabaBalapur-MSFT

@Blackmist

Could you please review and add your comments on this, update as appropriate.

Naveenommi-MSFT avatar Nov 14 '22 07:11 Naveenommi-MSFT

@kiranpradeep

Thanks for your feedback! I've assigned this issue to the author who will investigate and update as appropriate.

Naveenommi-MSFT avatar Nov 14 '22 07:11 Naveenommi-MSFT

Thanks for the question @kiranpradeep

@kvijaykannan and @sdgilley can you provide any information on whether there are any guarantees on the order of rows .to_dataframe_pandas() and .download() methods of datasets?

Blackmist avatar Nov 14 '22 13:11 Blackmist

#reassign:sdgilley

Blackmist avatar Nov 14 '22 13:11 Blackmist

@kiranpradeep The order of the the images should be the same as the execution follows identical path However there is a way more ergonomic way of achieving this goal which would not even need to download the files in the first place. Your 'image_url' column contains file pointer class StreamInfo that object implements open() method which is a File-like object and could be used by any python library that expects file objects so in effect you can change you code to look like this:

import azureml.core
from azureml.core import Dataset, Workspace

# get animal_labels dataset from the workspace
animal_labels = Dataset.get_by_name(workspace, 'animal_labels')
animal_pd = animal_labels.to_pandas_dataframe()

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

#read images directly from remote stream
img = mpimg.imread(animal_pd['image_url'].iloc(0).open())
imgplot = plt.imshow(img)

This will let you stream from remote storage directly and has an additional side benefit of not requiring the whole dataset to fit on disk (streaming processing)

As you can see in this case there is no risk of mismatched records as both image_url and labels would be coming from the same dataframe record

anliakho2 avatar Nov 18 '22 19:11 anliakho2

Thanks, @anliakho2! I'm updating our document to use your modification of the code.

sdgilley avatar Nov 18 '22 19:11 sdgilley

@sdgilley as for documentation I think a scenario where images are processed in a loop would be more appropriate. Also take not of updated comment when images are accessed as those are no longer downloaded, so comment needs to change too

anliakho2 avatar Nov 18 '22 19:11 anliakho2