BoostARoota Dask integration

Dask integration

Open bendruitt opened this issue 6 years ago • 3 comments

Much like your idea for pyspark integration, I would like to see simliar support for passing in a dask client as is supported by the dask-xgboost library. I have found initial success in reducing high dimensional data using the BoostaRoota library but find the bottleneck to be during the initial load of the parquet file repository. I'll offer what assitance I can regarding this work.

Ben.

Nov 08 '17 06:11 bendruitt

I have never used dask before, but have been wanting to look into it. This gives me a reason to! I'll start looking into set up and usage, but might reach back out to you for assistance. Feel free to send me an email: chasedehan at yahoo dot com

Nov 08 '17 16:11 chasedehan

Just an update: I have gotten dask and dask-xgboost working on my local and cluster, but will need to do some work on the shadow feature creation. I thought I would be able to just drop in the dxgb.train() along with Client(), but I am doing all the feature work under the hood with pandas. The dask dataframe is slightly different; it doesn't look too hard, but might take me a few days to work it out how it will fit in with the rest of the package. (I really want to avoid bloat on the main functionality)

For example, this is one of the helper functions I need to rework:

def _create_shadow(x_train):
    x_shadow = x_train.copy()
    for c in x_shadow.columns:
        np.random.shuffle(x_shadow[c].values)
    # rename the shadow
    shadow_names = ["ShadowVar" + str(i + 1) for i in range(x_train.shape[1])]
    x_shadow.columns = shadow_names
    # Combine to make one new dataframe
    new_x = pd.concat([x_train, x_shadow], axis=1)
    return new_x, shadow_names

Nov 15 '17 17:11 chasedehan

Hello, Is there any update on this feature? Would be great as it would speed up processing even more.

May 24 '19 12:05 jonimatix

BoostARoota BoostARoota copied to clipboard

Dask integration

BoostARoota
BoostARoota copied to clipboard