evalml
evalml copied to clipboard
Spike: Preserve woodwork schema through dask
In #2243, we added X_schema
and y_schema
arguments to train_pipeline
and score_pipeline
because the dask engine would lose the schema once it pickles the dataframe and sends it to the workers.
The underlying problem is that pickle
does not preserve the woodwork schema. This sounds like a general accessor thing and is probably not specific to woodwork.
import pandas as pd
import woodwork as ww
import pickle
import pytest
df = pd.DataFrame({"a": [1, 2, 3], 'id': [0, 1, 2]}, index=[4, 5, 6])
df.ww.init()
assert df.ww.schema is not None
assert pickle.loads(pickle.dumps(df)).ww.schema is None
This issue tracks figuring out how to make dask serialize woodwork data structures. Once we come up with a solution, we can discuss if this should live in evalml or woodwork.
Some resources to check out:
For whoever picks this up: we need to confirm why we can't use woodwork with dask right now, and write up an explanation of this. Then if we need to file any woodwork requests we should do so.
some related discussion on pickling pandas accessors https://github.com/pandas-dev/pandas/issues/32678 cc @gsheni @rwedge