woodwork icon indicating copy to clipboard operation
woodwork copied to clipboard

[SPIKE] Investigate whether Woodwork can be expanded to handle incoming `string` dtypes

Open ParthivNaresh opened this issue 2 years ago • 0 comments

Currently, if a user creates a Pandas dataframe and passes it into Woodwork, certain dtypes are already inferred in Pandas which makes inference significantly easier. However there might be cases where all incoming data is in the form of text and has a dtype of string.

For a dataframe initialized like this:

df = pd.DataFrame()
df["ints"] = [i for i in range(100)]
df["floats"] = [i*1.1 for i in range(100)]
df["bools"] = [True, False, False, True, False] * 20
df["bools_nan"] = [True, False, False, True, pd.NA] * 20
df["strings"] = [f"{i}" for i in range(100)]
df["categoricals"] = np.random.choice(["Yellow", "Blue", "Red"], 100)

Subsequent Woodwork initialization yields as expected: Screen Shot 2023-01-13 at 4 03 12 PM

But conversion of all dtypes to string prior to Woodwork initialization

for col in df.columns:
    df[col] = df[col].astype("string")

Yields this: Screen Shot 2023-01-13 at 4 03 21 PM

This spike covers investigation into what solution(s) exist for this and how/in what order it should be tackled (by logical type, or is there an approach that can tackle all at once).

ParthivNaresh avatar Jan 13 '23 21:01 ParthivNaresh