MachineLearningNotebooks AzureML TabularDataSet via parquet and pandas index error

AzureML TabularDataSet via parquet and pandas index error

Open vla6 opened this issue 4 years ago • 5 comments

Azure's TabularDataset implementation introduces an index, __index_level_0__ when creating or reading parquet files that were originally written by Pandas/Python. This occurs when an index is unnamed but has been modified at some point; if an index is named we get an extra column with the same name as the index.

When making changes to datasets, this additional field causes Azure errors if not handled. Depending on what's been done to the index of the original dataset, you may or may not get that additional field.

I have an example notebook that can be run to reproduce the issue. It's here: https://github.com/vla6/Azure_notes/blob/main/tabulardataset_parquet_index_di_issue.ipynb

The notebook requires an Azure Machine Learning workspace and a storage account to run

Jan 22 '21 19:01 vla6

At the very least, could we have more on valid indexing in docs?

Feb 06 '22 16:02 maciejskorski

Facing the same issue - is there an update yet?

Feb 28 '22 21:02 falconflightX

When creating datasets I make sure that I pass a data frame which is Azure-friendly indexed.

For example, enforce RangeIndex by df.reset_index(drop=True,inplace=True) or create a more sophisticated one df.set_index(['patient_id','encounter_id'],inplace=True) -

Apr 26 '22 07:04 maciejskorski

I had the same issue. While it was fixed by @maciejskorski's solution of resetting the index, it is annoying that Azure couldn't figure it out itself. The index was already a sequential integer index with no breaks, it just happened to be a pandas Int64 datatype for some reason, so while it looked and behaved for all purposes like a regular rangeindex, AzureML got confused.

Jun 07 '22 06:06 pat-hearps

This issue should be resolved on azureml-dataprep 4.8.x

Dec 14 '22 08:12 mobaniha

MachineLearningNotebooks MachineLearningNotebooks copied to clipboard

AzureML TabularDataSet via parquet and pandas index error

MachineLearningNotebooks
MachineLearningNotebooks copied to clipboard