azureml-sdk-for-r
azureml-sdk-for-r copied to clipboard
Strange data type for a dataset variable using 'create_tabular_dataset_from_delimited_files'
I'm importing the Kaggle cars dataset from an Azure Blob Storage.
dstore <- get_datastore(ws, datastore_name = 'ml_data_cool__data')
path <- data_path(dstore, 'car-features-and-msrp/car-features-data.csv')
car_ds <- create_tabular_dataset_from_delimited_files(path = path)
car_prices_tbl <- load_dataset_into_data_frame(car_ds) %>%
as_tibble()
Looking at the inferred data types, I can see a strange list data type for the variable "Engine Fuel Type":

I also tried to use the following code:
car_ds <- create_tabular_dataset_from_delimited_files(path = path,
set_column_types = reticulate::dict("Engine Fuel Type" = data_type_string()))
But I'm getting the same result.
Is it a bug? If not, how can I avoid a list for that variable?
I encounter the same issue when i pull in data from Azure SQL database/ csv file from Data Lake. It pulls some of the factor data type as list. I have to unlist them everytime during training as well as scoring and do the conversions within the R script manually for now.
Tried to convert them in dataset within studio but then we cannot load an existing dataset in R Studio within azure as that is an ongoing bug.
@lucazav - Did you find a workaround for this? This is creating issue for me during scoring as well.
I am having the same issue right now with loading from parquet dataset to dataframe, one of the columns in the dataframe is pulled as list, when it's expected as Date format.
I have also noticed that the bug does not occur consistently across environments, in my compute instance I am able to load the dataframe with the correct datatypes, but when submitting to compute cluster the data type is loaded incorrectly.