azure-docs
azure-docs copied to clipboard
mapping between downloaded images and pandas data frame rows
This page has belwo code sample to download images in a dataset and also to convert to pandas dataframe. I like to confirm if indexes in animal_pd (pandas dataframe) and download_path (list) match each other. For e.g. will download_path[0] always correspond to first entry in animal_pd ?
Is there a guaranatee that orders of rows in to_pandas_dataframe() will be same as order of elements in list returned by .download() ? Without this ordering, we cannot really map between dataframe and downloaded images.
import azureml.core
from azureml.core import Dataset, Workspace
# get animal_labels dataset from the workspace
animal_labels = Dataset.get_by_name(workspace, 'animal_labels')
animal_pd = animal_labels.to_pandas_dataframe()
# download the images to local
download_path = animal_labels.download(stream_column='image_url')
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
#read images from downloaded path
img = mpimg.imread(download_path[0])
imgplot = plt.imshow(img)
Document Details
⚠ Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.
- ID: c02d3f15-dbc8-a1d7-6f95-2f3527d002f0
- Version Independent ID: 6c17f89c-c984-c4f5-d9f8-b943521938fc
- Content: Create and explore datasets with labels - Azure Machine Learning
- Content Source: articles/machine-learning/v1/how-to-use-labeled-dataset.md
- Service: machine-learning
- Sub-service: mldata
- GitHub Login: @Blackmist
- Microsoft Alias: larryfr
@kiranpradeep Thanks for your feedback! We will investigate and update as appropriate.
@Blackmist
Could you please review and add your comments on this, update as appropriate.
@kiranpradeep
Thanks for your feedback! I've assigned this issue to the author who will investigate and update as appropriate.
Thanks for the question @kiranpradeep
@kvijaykannan and @sdgilley can you provide any information on whether there are any guarantees on the order of rows .to_dataframe_pandas() and .download() methods of datasets?
#reassign:sdgilley
@kiranpradeep The order of the the images should be the same as the execution follows identical path
However there is a way more ergonomic way of achieving this goal which would not even need to download the files in the first place. Your 'image_url' column contains file pointer class StreamInfo that object implements open() method which is a File-like object and could be used by any python library that expects file objects
so in effect you can change you code to look like this:
import azureml.core
from azureml.core import Dataset, Workspace
# get animal_labels dataset from the workspace
animal_labels = Dataset.get_by_name(workspace, 'animal_labels')
animal_pd = animal_labels.to_pandas_dataframe()
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
#read images directly from remote stream
img = mpimg.imread(animal_pd['image_url'].iloc(0).open())
imgplot = plt.imshow(img)
This will let you stream from remote storage directly and has an additional side benefit of not requiring the whole dataset to fit on disk (streaming processing)
As you can see in this case there is no risk of mismatched records as both image_url and labels would be coming from the same dataframe record
Thanks, @anliakho2! I'm updating our document to use your modification of the code.
@sdgilley as for documentation I think a scenario where images are processed in a loop would be more appropriate. Also take not of updated comment when images are accessed as those are no longer downloaded, so comment needs to change too