azureml-sdk-for-r Strange data type for a dataset variable using 'create_tabular_dataset_from_delimited

Strange data type for a dataset variable using 'create_tabular_dataset_from_delimited_files'

Open lucazav opened this issue 5 years ago • 4 comments

I'm importing the Kaggle cars dataset from an Azure Blob Storage.

dstore <- get_datastore(ws, datastore_name = 'ml_data_cool__data')
path <- data_path(dstore, 'car-features-and-msrp/car-features-data.csv')
car_ds <- create_tabular_dataset_from_delimited_files(path = path)
car_prices_tbl <- load_dataset_into_data_frame(car_ds) %>%
  as_tibble()

Looking at the inferred data types, I can see a strange list data type for the variable "Engine Fuel Type":

I also tried to use the following code:

car_ds <- create_tabular_dataset_from_delimited_files(path = path,
                                                      set_column_types = reticulate::dict("Engine Fuel Type" = data_type_string()))

But I'm getting the same result.

Is it a bug? If not, how can I avoid a list for that variable?

Jul 29 '20 15:07 lucazav

I encounter the same issue when i pull in data from Azure SQL database/ csv file from Data Lake. It pulls some of the factor data type as list. I have to unlist them everytime during training as well as scoring and do the conversions within the R script manually for now.

Tried to convert them in dataset within studio but then we cannot load an existing dataset in R Studio within azure as that is an ongoing bug.

Jul 29 '20 21:07 harshbangad

@lucazav - Did you find a workaround for this? This is creating issue for me during scoring as well.

Aug 04 '20 21:08 harshbangad

I am having the same issue right now with loading from parquet dataset to dataframe, one of the columns in the dataframe is pulled as list, when it's expected as Date format.

Oct 23 '20 14:10 jakeatmsft

I have also noticed that the bug does not occur consistently across environments, in my compute instance I am able to load the dataframe with the correct datatypes, but when submitting to compute cluster the data type is loaded incorrectly.

Oct 23 '20 15:10 jakeatmsft

azureml-sdk-for-r azureml-sdk-for-r copied to clipboard

Strange data type for a dataset variable using 'create_tabular_dataset_from_delimited_files'

azureml-sdk-for-r
azureml-sdk-for-r copied to clipboard