scikit-lego icon indicating copy to clipboard operation
scikit-lego copied to clipboard

preprocessing.RepeatingBasisFunction encodes different labels as the same[BUG]

Open yug030 opened this issue 1 year ago • 2 comments

Hi,

I'm working on a task to encode timestamps (more specifically, weekdays) and I noticed RepeatingBasisFunction does not work correctly with weekday=0 and weekday=6: the outputs for both are the same.

Here's the code snippet to reproduce the bug:

import pandas as pd
from sklego.preprocessing import RepeatingBasisFunction

start_date = '2023-06-25 00:00:00'
end_date = '2023-07-03 23:59:59'
timestamps = pd.date_range(start=start_date, end=end_date, freq='1D')
df = pd.DataFrame({'Timestamps': timestamps})
df['wkday'] = df['Timestamps'].dt.weekday
print(df)
rbf = RepeatingBasisFunction(n_periods=4, column="wkday", input_range=(0,6), remainder="drop")
rbf.fit(df)
print(rbf.transform(df))

The output for me is:

  Timestamps  wkday
0 2023-06-25      6
1 2023-06-26      0
2 2023-06-27      1
3 2023-06-28      2
4 2023-06-29      3
5 2023-06-30      4
6 2023-07-01      5
7 2023-07-02      6
8 2023-07-03      0

array([[1.        , 0.36787944, 0.01831564, 0.36787944],
       [1.        , 0.36787944, 0.01831564, 0.36787944],
       [0.64118039, 0.89483932, 0.16901332, 0.06217652],
       [0.16901332, 0.89483932, 0.64118039, 0.06217652],
       [0.01831564, 0.36787944, 1.        , 0.36787944],
       [0.16901332, 0.06217652, 0.64118039, 0.89483932],
       [0.64118039, 0.06217652, 0.16901332, 0.89483932],
       [1.        , 0.36787944, 0.01831564, 0.36787944],
       [1.        , 0.36787944, 0.01831564, 0.36787944]])

Update: If I add 1 to "wkday" column (so it's 1 to 7 now) and use rbf = RepeatingBasisFunction(n_periods=7, column="wkday", input_range=(0,7), remainder="drop") and call fit and transform, I will get a square matrix with 1.0 on the diagonal which looks correct, but it's different from the documentation.

yug030 avatar Aug 03 '23 07:08 yug030

Hi, thank you for reporting the issue and sorry for the late reply!

This seems to be unexpected/buggy behavior. My understanding is that some assumptions have been made in the implementation and it may be worth to adjust or clearly state them.

I will take the time to have a closer look at it.

FBruzzesi avatar Nov 15 '23 07:11 FBruzzesi