BoostARoota
BoostARoota copied to clipboard
Dask integration
Much like your idea for pyspark integration, I would like to see simliar support for passing in a dask client as is supported by the dask-xgboost library. I have found initial success in reducing high dimensional data using the BoostaRoota library but find the bottleneck to be during the initial load of the parquet file repository. I'll offer what assitance I can regarding this work.
Ben.
I have never used dask before, but have been wanting to look into it. This gives me a reason to! I'll start looking into set up and usage, but might reach back out to you for assistance. Feel free to send me an email: chasedehan at yahoo dot com
Just an update: I have gotten dask and dask-xgboost working on my local and cluster, but will need to do some work on the shadow feature creation. I thought I would be able to just drop in the dxgb.train() along with Client(), but I am doing all the feature work under the hood with pandas. The dask dataframe is slightly different; it doesn't look too hard, but might take me a few days to work it out how it will fit in with the rest of the package. (I really want to avoid bloat on the main functionality)
For example, this is one of the helper functions I need to rework:
def _create_shadow(x_train):
x_shadow = x_train.copy()
for c in x_shadow.columns:
np.random.shuffle(x_shadow[c].values)
# rename the shadow
shadow_names = ["ShadowVar" + str(i + 1) for i in range(x_train.shape[1])]
x_shadow.columns = shadow_names
# Combine to make one new dataframe
new_x = pd.concat([x_train, x_shadow], axis=1)
return new_x, shadow_names
Hello, Is there any update on this feature? Would be great as it would speed up processing even more.