azureml-sdk-for-r icon indicating copy to clipboard operation
azureml-sdk-for-r copied to clipboard

load_dataset_into_data_frame() generates an error

Open jon-nagra opened this issue 4 years ago • 14 comments

I have replicated the consume piece of code from python in R. However, when using load_dataset_into_data_frame() I run into an error:

Error in py_get_attr_impl(x, name, silent): AttributeError: 'TabularDataset' object has no attribute 'to_pandas_data_frame' Traceback:

  1. load_dataset_into_data_frame(dataset)
  2. dataset$to_pandas_data_frame
  3. $.python.builtin.object(dataset, "to_pandas_data_frame")
  4. py_get_attr_or_item(x, name, TRUE)
  5. py_get_attr(x, name)
  6. py_get_attr_impl(x, name, silent)

I use the same steps as in python. python piece (it works):

from azureml.core import Workspace, Dataset

subscription_id = 'subscrpt_id'
resource_group = 'resource_id'
workspace_name = 'wspace_name'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='dataset_name')
dataset.to_pandas_dataframe()

R piece:

library(azuremlsdk)

subscription_id = 'subscrpt_id'
resource_group = 'resource_id'
workspace_name = 'wspace_name'

workspace = get_workspace(subscription_id  = subscription_id, resource_group = resource_group, name = workspace_name)

dataset = get_dataset_by_name(workspace, name='dataset_name')
load_dataset_into_data_frame(dataset)

I would expect the last command to provide a data.frame as dataset is of class azureml.data.tabular_dataset.TabularDataset

jon-nagra avatar May 22 '20 09:05 jon-nagra

I've having similar issues with many of the datasets functions as well. For example say ds is my dataset, then

download_from_file_dataset(ds)
Error in py_get_attr_impl(x, name, silent) : 
  AttributeError: 'TabularDataset' object has no attribute 'download'

gives me the above error.

krenova avatar May 26 '20 14:05 krenova

I have the same issue with R dataset, see code

library(azuremlsdk)
ws <- load_workspace_from_config()
ds_acsanalytics <- get_dataset_by_name(ws, "AllSkills", version = "latest")

df_acsanalytics <- load_dataset_into_data_frame(ds_acsanalytics)

then getting an error:

> df_acsanalytics <- load_dataset_into_data_frame(ds_acsanalytics)
Error in py_get_attr_impl(x, name, silent) : 
  AttributeError: 'TabularDataset' object has no attribute 'to_pandas_data_frame'

someone know how to resolve this issue?

atulupov avatar May 28 '20 12:05 atulupov

hi @atulupov and @jon-naga, not really sure whats going on but I suspect that there is a delay updating the R wrappers with respect to the changes in the python library. For your cases, it appears that the following command should work:

df_acsanalytics <- dataset$to_pandas_dataframe()

krenova avatar Jun 01 '20 05:06 krenova

hi @krenova I've tried this earlier and I've got another error: DatasetExecutionError: Could not import pandas after getting this error, I've tried to reinstall pandas with pyarrow but it doesn't help

atulupov avatar Jun 01 '20 07:06 atulupov

I'm guess that R sdk should reflect all latest changes in the python library maybe you have some roadmap of updates? right now I see a lot of non working functions in the latest version ((

atulupov avatar Jun 01 '20 07:06 atulupov

@atulupov, i've no idea too. I was about to give up on azuremlsdk until i found the above solution.

btw, it appears that you have some issues with the pandas library from python. you might want to try updating your python packages in your r-reticulate environment. (I assume that you have already setup the r-reticulate environment in conda)

If your environment has been setup, then go either to cmd or conda console and key in the following:

source active r-reticulate
pip install azureml-dataprep[pandas]

That should update the pandas library in python and hopefully it addresses the error you faced.

krenova avatar Jun 01 '20 07:06 krenova

I have been able to make it work by installing pandas in reticulate and running the piece of code @krenova suggested :

reticulate::py_install("pandas")
dataset$to_pandas_dataframe()

load_dataset_into_data_frame() returns the same error.

When trying:

source activate r-reticulate
pip install azureml-dataprep[pandas]

I was getting the following error:

Could not find conda environment: r-reticulate

Thank you both for your suggestions, at least now I have a workaround.

jon-nagra avatar Jun 01 '20 09:06 jon-nagra

@jon-nagra , great stuff. I think yours is a better solution as it directly installs the required package into the same environment that the python your R calls sits on.

Just for others interest sake, 'r-reticulate' is the name of the conda environment my python sits on, as given by the azuremlsdk tutorial. if the environment your python sits on has a different name, then you have to change the environment name accordingly. But ultimately my advice would be to use @jon-nagra approach for the installation of pandas and other packages.

krenova avatar Jun 01 '20 09:06 krenova

thanks a lot for your advises

atulupov avatar Jun 01 '20 11:06 atulupov

@jon-nagra The initial bug Error in py_get_attr_impl(x, name, silent): AttributeError: 'TabularDataset' object has no attribute 'to_pandas_data_frame' Traceback: has been fixed since the last CRAN release, so if you install the SDK from GitHub you should no longer get this issue.

As for the pyarrow/pandas conflict issue, the best workaround is the one above to install azureml-dataprep[pandas] in the r-reticulate environment. We are currently looking into improving this.

mx-iao avatar Jun 24 '20 01:06 mx-iao

@mx-iao I updated the packages with the dev version and load_dataset_into_data_frame() works. However, the estimator function breaks in the r_environment piece when including CRAN packages (I don't know if it is a desired design change, shall I file a bug on this?) I run the data load piece of the code in an experiment run on a compute cluster. To avoid issues, is it better to stick to dataset$to_pandas_dataframe() or can I include the azuremlsdk GitHub package in the estimator?

jon-nagra avatar Jun 24 '20 08:06 jon-nagra

@jon-nagra, you can just stick with dataset$to_pandas_dataframe() for now; it's the same thing. @jon-nagra what is the issue you are seeing with estimator?

mx-iao avatar Jun 24 '20 20:06 mx-iao

@mx-iao, when submitting a experiment using submit_experiment with an estimator, the r_environment function breaks in this loop:

  if (!is.null(cran_packages)) {
    env$r$cran_packages <- list()
    for (package in cran_packages) {
      cran_package <- azureml$core$environment$RCranPackage()
      cran_package$name <- package$name
      cran_package$version <- package$version
      cran_package$repository <- package$repository
      env$r$cran_packages <- c(env$r$cran_packages, cran_package)
    }
  }

package is a character and, hence, it doesn't have a name. For instance my cran_packages in the estimator call look like cran_packages = c("seasonal", "here") In the CRAN version, r_environment seems to use a different approach to load the packages. I'm not sure if it is a design choice and things like the package version should now be included in the cran_packages or if it is a bug.

jon-nagra avatar Jun 25 '20 01:06 jon-nagra

@jon-nagra, you can just stick with dataset$to_pandas_dataframe() for now; it's the same thing. @jon-nagra what is the issue you are seeing with estimator?

I am trying to load data from azure sql dataset within a Rscript through an estimator and I am getting below error using sdk 1.10 (I am allowing all traffic to my azure sql for test purposes)

rror in py_call_impl(callable, dots$args, dots$keywords) : DatasetExecutionError: in operation 'to_pandas_dataframe' for Dataset(id='f5aec110-67ee-443b-aa64-282cfaf8592c', name='None', version=None) Error Code: ScriptExecution.DatabaseConnection.Unexpected Failed Step: 1582c0da-87bc-4c29-9fb0-febe48df8a2a Error Message: ScriptExecutionException was caused by DatabaseConnectionException. DatabaseConnectionException was caused by UnexpectedException. 'MSSQL' encountered unexpected exception of type 'InvalidOperationException' with HResult 'x80131509' while opening connection. Internal connection fatal error. | session_id=9e9c62cc-cff0-4ce1-9d32-e99c8042xxxx

anoviceds avatar Aug 27 '20 05:08 anoviceds