azureml-sdk-for-r
azureml-sdk-for-r copied to clipboard
load_dataset_into_data_frame() generates an error
I have replicated the consume piece of code from python
in R
. However, when using load_dataset_into_data_frame()
I run into an error:
Error in py_get_attr_impl(x, name, silent): AttributeError: 'TabularDataset' object has no attribute 'to_pandas_data_frame' Traceback:
- load_dataset_into_data_frame(dataset)
- dataset$to_pandas_data_frame
$.python.builtin.object
(dataset, "to_pandas_data_frame")- py_get_attr_or_item(x, name, TRUE)
- py_get_attr(x, name)
- py_get_attr_impl(x, name, silent)
I use the same steps as in python. python piece (it works):
from azureml.core import Workspace, Dataset
subscription_id = 'subscrpt_id'
resource_group = 'resource_id'
workspace_name = 'wspace_name'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='dataset_name')
dataset.to_pandas_dataframe()
R piece:
library(azuremlsdk)
subscription_id = 'subscrpt_id'
resource_group = 'resource_id'
workspace_name = 'wspace_name'
workspace = get_workspace(subscription_id = subscription_id, resource_group = resource_group, name = workspace_name)
dataset = get_dataset_by_name(workspace, name='dataset_name')
load_dataset_into_data_frame(dataset)
I would expect the last command to provide a data.frame
as dataset is of class azureml.data.tabular_dataset.TabularDataset
I've having similar issues with many of the datasets functions as well. For example say ds
is my dataset, then
download_from_file_dataset(ds)
Error in py_get_attr_impl(x, name, silent) :
AttributeError: 'TabularDataset' object has no attribute 'download'
gives me the above error.
I have the same issue with R dataset, see code
library(azuremlsdk)
ws <- load_workspace_from_config()
ds_acsanalytics <- get_dataset_by_name(ws, "AllSkills", version = "latest")
df_acsanalytics <- load_dataset_into_data_frame(ds_acsanalytics)
then getting an error:
> df_acsanalytics <- load_dataset_into_data_frame(ds_acsanalytics)
Error in py_get_attr_impl(x, name, silent) :
AttributeError: 'TabularDataset' object has no attribute 'to_pandas_data_frame'
someone know how to resolve this issue?
hi @atulupov and @jon-naga, not really sure whats going on but I suspect that there is a delay updating the R wrappers with respect to the changes in the python library. For your cases, it appears that the following command should work:
df_acsanalytics <- dataset$to_pandas_dataframe()
hi @krenova I've tried this earlier and I've got another error:
DatasetExecutionError: Could not import pandas
after getting this error, I've tried to reinstall pandas with pyarrow but it doesn't help
I'm guess that R sdk should reflect all latest changes in the python library maybe you have some roadmap of updates? right now I see a lot of non working functions in the latest version ((
@atulupov, i've no idea too. I was about to give up on azuremlsdk until i found the above solution.
btw, it appears that you have some issues with the pandas library from python. you might want to try updating your python packages in your r-reticulate environment. (I assume that you have already setup the r-reticulate environment in conda)
If your environment has been setup, then go either to cmd or conda console and key in the following:
source active r-reticulate
pip install azureml-dataprep[pandas]
That should update the pandas library in python and hopefully it addresses the error you faced.
I have been able to make it work by installing pandas in reticulate and running the piece of code @krenova suggested :
reticulate::py_install("pandas")
dataset$to_pandas_dataframe()
load_dataset_into_data_frame()
returns the same error.
When trying:
source activate r-reticulate
pip install azureml-dataprep[pandas]
I was getting the following error:
Could not find conda environment: r-reticulate
Thank you both for your suggestions, at least now I have a workaround.
@jon-nagra , great stuff. I think yours is a better solution as it directly installs the required package into the same environment that the python your R calls sits on.
Just for others interest sake, 'r-reticulate' is the name of the conda environment my python sits on, as given by the azuremlsdk tutorial. if the environment your python sits on has a different name, then you have to change the environment name accordingly. But ultimately my advice would be to use @jon-nagra approach for the installation of pandas and other packages.
thanks a lot for your advises
@jon-nagra The initial bug
Error in py_get_attr_impl(x, name, silent): AttributeError: 'TabularDataset' object has no attribute 'to_pandas_data_frame' Traceback:
has been fixed since the last CRAN release, so if you install the SDK from GitHub you should no longer get this issue.
As for the pyarrow/pandas conflict issue, the best workaround is the one above to install azureml-dataprep[pandas]
in the r-reticulate environment. We are currently looking into improving this.
@mx-iao I updated the packages with the dev version and load_dataset_into_data_frame()
works.
However, the estimator
function breaks in the r_environment
piece when including CRAN packages (I don't know if it is a desired design change, shall I file a bug on this?)
I run the data load piece of the code in an experiment run on a compute cluster.
To avoid issues, is it better to stick to dataset$to_pandas_dataframe()
or can I include the azuremlsdk
GitHub package in the estimator?
@jon-nagra, you can just stick with dataset$to_pandas_dataframe()
for now; it's the same thing. @jon-nagra what is the issue you are seeing with estimator?
@mx-iao, when submitting a experiment using submit_experiment
with an estimator, the r_environment
function breaks in this loop:
if (!is.null(cran_packages)) {
env$r$cran_packages <- list()
for (package in cran_packages) {
cran_package <- azureml$core$environment$RCranPackage()
cran_package$name <- package$name
cran_package$version <- package$version
cran_package$repository <- package$repository
env$r$cran_packages <- c(env$r$cran_packages, cran_package)
}
}
package
is a character and, hence, it doesn't have a name. For instance my cran_packages
in the estimator call look like cran_packages = c("seasonal", "here")
In the CRAN version, r_environment
seems to use a different approach to load the packages.
I'm not sure if it is a design choice and things like the package version should now be included in the cran_packages
or if it is a bug.
@jon-nagra, you can just stick with
dataset$to_pandas_dataframe()
for now; it's the same thing. @jon-nagra what is the issue you are seeing with estimator?
I am trying to load data from azure sql dataset within a Rscript through an estimator and I am getting below error using sdk 1.10 (I am allowing all traffic to my azure sql for test purposes)
rror in py_call_impl(callable, dots$args, dots$keywords) : DatasetExecutionError: in operation 'to_pandas_dataframe' for Dataset(id='f5aec110-67ee-443b-aa64-282cfaf8592c', name='None', version=None) Error Code: ScriptExecution.DatabaseConnection.Unexpected Failed Step: 1582c0da-87bc-4c29-9fb0-febe48df8a2a Error Message: ScriptExecutionException was caused by DatabaseConnectionException. DatabaseConnectionException was caused by UnexpectedException. 'MSSQL' encountered unexpected exception of type 'InvalidOperationException' with HResult 'x80131509' while opening connection. Internal connection fatal error. | session_id=9e9c62cc-cff0-4ce1-9d32-e99c8042xxxx