MachineLearningNotebooks
MachineLearningNotebooks copied to clipboard
AzureML TabularDataSet via parquet and pandas index error
Azure's TabularDataset implementation introduces an index, __index_level_0__ when creating or reading parquet files that were originally written by Pandas/Python. This occurs when an index is unnamed but has been modified at some point; if an index is named we get an extra column with the same name as the index.
When making changes to datasets, this additional field causes Azure errors if not handled. Depending on what's been done to the index of the original dataset, you may or may not get that additional field.
I have an example notebook that can be run to reproduce the issue. It's here: https://github.com/vla6/Azure_notes/blob/main/tabulardataset_parquet_index_di_issue.ipynb
The notebook requires an Azure Machine Learning workspace and a storage account to run
At the very least, could we have more on valid indexing in docs?
Facing the same issue - is there an update yet?
When creating datasets I make sure that I pass a data frame which is Azure-friendly indexed.
For example, enforce RangeIndex
by df.reset_index(drop=True,inplace=True)
or create a more sophisticated one df.set_index(['patient_id','encounter_id'],inplace=True)
-
I had the same issue. While it was fixed by @maciejskorski's solution of resetting the index, it is annoying that Azure couldn't figure it out itself. The index was already a sequential integer index with no breaks, it just happened to be a pandas Int64 datatype for some reason, so while it looked and behaved for all purposes like a regular rangeindex, AzureML got confused.
This issue should be resolved on azureml-dataprep 4.8.x