woodwork
woodwork copied to clipboard
[SPIKE] Investigate whether Woodwork can be expanded to handle incoming `string` dtypes
Currently, if a user creates a Pandas dataframe and passes it into Woodwork, certain dtypes are already inferred in Pandas which makes inference significantly easier. However there might be cases where all incoming data is in the form of text and has a dtype of string
.
For a dataframe initialized like this:
df = pd.DataFrame()
df["ints"] = [i for i in range(100)]
df["floats"] = [i*1.1 for i in range(100)]
df["bools"] = [True, False, False, True, False] * 20
df["bools_nan"] = [True, False, False, True, pd.NA] * 20
df["strings"] = [f"{i}" for i in range(100)]
df["categoricals"] = np.random.choice(["Yellow", "Blue", "Red"], 100)
Subsequent Woodwork initialization yields as expected:
But conversion of all dtypes to string
prior to Woodwork initialization
for col in df.columns:
df[col] = df[col].astype("string")
Yields this:
This spike covers investigation into what solution(s) exist for this and how/in what order it should be tackled (by logical type, or is there an approach that can tackle all at once).