scikit-lego
scikit-lego copied to clipboard
preprocessing.RepeatingBasisFunction encodes different labels as the same[BUG]
Hi,
I'm working on a task to encode timestamps (more specifically, weekdays) and I noticed RepeatingBasisFunction does not work correctly with weekday=0 and weekday=6: the outputs for both are the same.
Here's the code snippet to reproduce the bug:
import pandas as pd
from sklego.preprocessing import RepeatingBasisFunction
start_date = '2023-06-25 00:00:00'
end_date = '2023-07-03 23:59:59'
timestamps = pd.date_range(start=start_date, end=end_date, freq='1D')
df = pd.DataFrame({'Timestamps': timestamps})
df['wkday'] = df['Timestamps'].dt.weekday
print(df)
rbf = RepeatingBasisFunction(n_periods=4, column="wkday", input_range=(0,6), remainder="drop")
rbf.fit(df)
print(rbf.transform(df))
The output for me is:
Timestamps wkday
0 2023-06-25 6
1 2023-06-26 0
2 2023-06-27 1
3 2023-06-28 2
4 2023-06-29 3
5 2023-06-30 4
6 2023-07-01 5
7 2023-07-02 6
8 2023-07-03 0
array([[1. , 0.36787944, 0.01831564, 0.36787944],
[1. , 0.36787944, 0.01831564, 0.36787944],
[0.64118039, 0.89483932, 0.16901332, 0.06217652],
[0.16901332, 0.89483932, 0.64118039, 0.06217652],
[0.01831564, 0.36787944, 1. , 0.36787944],
[0.16901332, 0.06217652, 0.64118039, 0.89483932],
[0.64118039, 0.06217652, 0.16901332, 0.89483932],
[1. , 0.36787944, 0.01831564, 0.36787944],
[1. , 0.36787944, 0.01831564, 0.36787944]])
Update:
If I add 1 to "wkday" column (so it's 1 to 7 now) and use rbf = RepeatingBasisFunction(n_periods=7, column="wkday", input_range=(0,7), remainder="drop")
and call fit
and transform
, I will get a square matrix with 1.0 on the diagonal which looks correct, but it's different from the documentation.
Hi, thank you for reporting the issue and sorry for the late reply!
This seems to be unexpected/buggy behavior. My understanding is that some assumptions have been made in the implementation and it may be worth to adjust or clearly state them.
I will take the time to have a closer look at it.