azureml-sdk-for-r icon indicating copy to clipboard operation
azureml-sdk-for-r copied to clipboard

Strange data type for a dataset variable using 'create_tabular_dataset_from_delimited_files'

Open lucazav opened this issue 5 years ago • 4 comments

I'm importing the Kaggle cars dataset from an Azure Blob Storage.

dstore <- get_datastore(ws, datastore_name = 'ml_data_cool__data')
path <- data_path(dstore, 'car-features-and-msrp/car-features-data.csv')
car_ds <- create_tabular_dataset_from_delimited_files(path = path)
car_prices_tbl <- load_dataset_into_data_frame(car_ds) %>%
  as_tibble()

Looking at the inferred data types, I can see a strange list data type for the variable "Engine Fuel Type":

image

I also tried to use the following code:

car_ds <- create_tabular_dataset_from_delimited_files(path = path,
                                                      set_column_types = reticulate::dict("Engine Fuel Type" = data_type_string()))

But I'm getting the same result.

Is it a bug? If not, how can I avoid a list for that variable?

lucazav avatar Jul 29 '20 15:07 lucazav

I encounter the same issue when i pull in data from Azure SQL database/ csv file from Data Lake. It pulls some of the factor data type as list. I have to unlist them everytime during training as well as scoring and do the conversions within the R script manually for now.

Tried to convert them in dataset within studio but then we cannot load an existing dataset in R Studio within azure as that is an ongoing bug.

harshbangad avatar Jul 29 '20 21:07 harshbangad

@lucazav - Did you find a workaround for this? This is creating issue for me during scoring as well.

harshbangad avatar Aug 04 '20 21:08 harshbangad

I am having the same issue right now with loading from parquet dataset to dataframe, one of the columns in the dataframe is pulled as list, when it's expected as Date format.

jakeatmsft avatar Oct 23 '20 14:10 jakeatmsft

I have also noticed that the bug does not occur consistently across environments, in my compute instance I am able to load the dataframe with the correct datatypes, but when submitting to compute cluster the data type is loaded incorrectly.

jakeatmsft avatar Oct 23 '20 15:10 jakeatmsft