evalml Spike: Preserve woodwork schema through dask

Spike: Preserve woodwork schema through dask

Open freddyaboulton opened this issue 3 years ago • 2 comments

In #2243, we added X_schema and y_schema arguments to train_pipeline and score_pipeline because the dask engine would lose the schema once it pickles the dataframe and sends it to the workers.

The underlying problem is that pickle does not preserve the woodwork schema. This sounds like a general accessor thing and is probably not specific to woodwork.

import pandas as pd
import woodwork as ww
import pickle
import pytest

df = pd.DataFrame({"a": [1, 2, 3], 'id': [0, 1, 2]}, index=[4, 5, 6])
df.ww.init()
assert df.ww.schema is not None

assert pickle.loads(pickle.dumps(df)).ww.schema is None

This issue tracks figuring out how to make dask serialize woodwork data structures. Once we come up with a solution, we can discuss if this should live in evalml or woodwork.

Some resources to check out:

May 11 '21 15:05 freddyaboulton

For whoever picks this up: we need to confirm why we can't use woodwork with dask right now, and write up an explanation of this. Then if we need to file any woodwork requests we should do so.

May 14 '21 17:05 dsherry

some related discussion on pickling pandas accessors https://github.com/pandas-dev/pandas/issues/32678 cc @gsheni @rwedge

Sep 29 '21 17:09 kmax12

evalml evalml copied to clipboard

Spike: Preserve woodwork schema through dask

evalml
evalml copied to clipboard