azureml-sdk-for-r mount & download modes for FileDataset inputs for remote runs not working as expected

mount & download modes for FileDataset inputs for remote runs not working as expected

Open mx-iao opened this issue 4 years ago • 4 comments

When a FileDataset as passed as an input to a run with either the mount or download mode, e.g.:

est <- estimator(source_directory = ".", 
entry_script = "train.R", 
compute_target = compute_target, 
environment = my_env, 
inputs = list(dataset_consumption_config("mydata", dataset, mode='mount'))
)

the expected behavior is that the data will get mounted (or downloaded), and that the mount (or download) path can be accessed with the following in the training script:

base_path= get_input_dataset_from_run("mydata")

# OR
base_path = Sys.getenv("AZUREML_DATAREFERENCE_mydata")

However this is not the case. Looking at the driver log, it looks like the DatasetContextManager never gets initialized so the mounting isn't happening.

Apr 28 '20 01:04 mx-iao

I'm trying to use input dataset in my R training on remote cluster. When passing named data set as input, I can see the input in the AML Studio portal when looking at the run, but I am not seeing input data sets when printing run$input_datasets from within the R script. Is this a related issue?

Jul 17 '20 17:07 jakeatmsft

@jakeatmsft if you specify an input file dataset in mount or download mode, the expected behavior if you call get_input_dataset_from_run("my-dataset") or the equivalent

run <- get_current_run()
run$input_dataset["my-dataset"]

in the training script is to get the mount or download path. However this is not the case currently due to the above bug. (If you call the above code the only thing you will get back is "DatasetConsumptionConfig:my-dataset" which is not useful.

If you specify an input file dataset in "direct" mode, then calling get_input_dataset_from_run("my-dataset") will return the actual FileDataset object, which is the expected behavior.

If you want to mount or download your files, the workaround is to just specify the path on your datastore directly (aka a DataReference), see this tutorial for an example: https://azure.github.io/azureml-sdk-for-r/articles/train-and-deploy-first-model.html

Jul 17 '20 21:07 mx-iao

@mx-iao , Thank you for the reply, using this approach, can I download files using a pattern such as {path}\data* ?

Jul 20 '20 15:07 jakeatmsft

I can confirm that I can pass data using DataReference from my registered DataStores. Thanks!

ref: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.data_reference.datareference?view=azure-ml-py

Jul 21 '20 03:07 jakeatmsft

azureml-sdk-for-r azureml-sdk-for-r copied to clipboard

mount & download modes for FileDataset inputs for remote runs not working as expected

azureml-sdk-for-r
azureml-sdk-for-r copied to clipboard