dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

sklearn StandardScaler vs dask StandardScaler.

Open Arunes007 opened this issue 2 years ago • 1 comments

I am getting different results from sklearn StandardScaler and dask StandardScaler.

scaler_sk = sklearn.preprocessing.StandardScaler()
scaler_d = dask_ml.preprocessing.StandardScaler()

scaler_sk.fit(df_pd[["SUMMESSAGECOUNT"]])
scaler_d.fit(df_dask[["SUMMESSAGECOUNT"]])

Dask scaler

scaler_d.mean_[0], scaler_d.var_[0]
output: (19.157653421114507, 47431.17794342375)

Sklearn Scaler

scaler_sk.mean_[0], scaler_sk.var_[0]
output: (19.157653421114507, 47431.17794342373)

I know the difference is negligible. But it is influencing my model training on prophet. Could you please suggest any way to make them identical without using compute().

Arunes007 avatar Dec 01 '23 11:12 Arunes007

I think that floating point inaccuracies are just a fact of life when you’re doing things in chunks, at least with the algorithms that dask.array uses today. I don’t think there’s anything we can do in dask-ml to address that (but maybe check the source to be sure).

On Dec 1, 2023, at 5:35 AM, Arunesh Singh @.***> wrote:

I am getting different results from sklearn StandardScaler and dask StandardScaler.

scaler_sk = sklearn.preprocessing.StandardScaler() scaler_d = dask_ml.preprocessing.StandardScaler()

scaler_sk.fit(df_pd[["SUMMESSAGECOUNT"]]) scaler_d.fit(df_dask[["SUMMESSAGECOUNT"]]) Dask scaler

scaler_d.mean_[0], scaler_d.var_[0] output: (19.157653421114507, 47431.17794342375) Sklearn Scaler

scaler_sk.mean_[0], scaler_sk.var_[0] output: (19.157653421114507, 47431.17794342373) I know the difference is negligible. But it is influencing my model training on prophet. Could you please suggest any way to make them identical without using compute().

— Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/979 or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQLOIVBEFL4GC2IBMLYHG6G5BFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJLJONZXKZNENZQW2ZNLORUHEZLBMRPXI6LQMWBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTLDTOVRGUZLDORPXI6LQMWSUS43TOVS2M5DPOBUWG44SQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKBZGQ2DKNJXGQ2YFJDUPFYGLJLJONZXKZNFOZQWY5LFVIZDAMRQG4YDCNRYGKTXI4TJM5TWK4VGMNZGKYLUMU. You are receiving this email because you are subscribed to this thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

TomAugspurger avatar Dec 02 '23 13:12 TomAugspurger