RDT
RDT copied to clipboard
Fitting with `numerical` column names fails
Environment Details
Please indicate the following details about the environment in which you found the bug:
- RDT version: 0.6.1
- Python version: 3.8
- Operating System: Ubuntu
Error Description
When fitting any Transformer with a pd.DataFrame that contains as column names a RangeIndex, or a numerical value as column name, those end up failing.
This bug can produce two errors:
- Multiple columns
- Single columns
Steps to reproduce
Multiple columns
from rdt.transformers import OneHotEncoder
data = pd.DataFrame([
['a', 'b', 'c'],
['d', 'e', 'f']
])
ohe = OneHotEncoder()
ohe.fit(data, data.columns)
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-9be2b41b4858> in <module>
----> 1 ohe.fit(data, data.columns)
~/Projects/sdv-dev/RDT/rdt/transformers/base.py in fit(self, data, columns)
163 Column names. Must be present in the data.
164 """
--> 165 self._store_columns(columns, data)
166
167 columns_data = self._get_columns_data(data, self.columns)
~/Projects/sdv-dev/RDT/rdt/transformers/base.py in _store_columns(self, columns, data)
112 columns = [columns]
113
--> 114 missing = set(columns) - set(data.columns)
115 if missing:
116 raise KeyError(f'Columns {missing} were not present in the data.')
~/.virtualenvs/RDT/lib/python3.9/site-packages/pandas/core/indexes/base.py in __hash__(self)
4076
4077 def __hash__(self):
-> 4078 raise TypeError(f"unhashable type: {repr(type(self).__name__)}")
4079
4080 def __setitem__(self, key, value):
TypeError: unhashable type: 'RangeIndex'
Using a single column
ohe.fit(data, 0)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-32-5d4d0160e7be> in <module>
----> 1 ohe.fit(data, 0)
~/Projects/sdv-dev/RDT/rdt/transformers/base.py in fit(self, data, columns)
168 self._fit(columns_data)
169
--> 170 self._build_output_columns(data)
171
172 def _transform(self, columns_data):
~/Projects/sdv-dev/RDT/rdt/transformers/base.py in _build_output_columns(self, data)
136
137 def _build_output_columns(self, data):
--> 138 self.column_prefix = '#'.join(self.columns)
139 self.output_columns = list(self.get_output_types().keys())
140
TypeError: sequence item 0: expected str instance, int found
Notes
This errors appear in _store_columns for multiple columns and _build_output_columns for single column.
I can confirm that this issue still persists in RDT 1.0:
import pandas as pd
from rdt import HyperTransformer
data = pd.DataFrame([
['a', 'b', 'c'],
['d', 'e', 'f']
])
ht = HyperTransformer()
ht.detect_initial_config(data)
ht.fit_transform(data)
Output:
TypeError: sequence item 0: expected str instance, int found