BigDL-2.x
BigDL-2.x copied to clipboard
Chronos TSDataset: various enhancement
- [x] Enable customized feature generation in
gen_dt_feature
. - [x] API doc for TSDataset.
- [ ] Onehotencoder operation.
- [x] _unscale_numpy -> unscale_numpy with api doc
- [x] resample need default start_time and end_time
add YEAR
feature in gen_dt_feature
.
remove the quote in result column names of gen_dt_feature
.
e.g. MONTH(StartTime) -> MONTH
When non_pd_datetime appears, impute("linear") will cause the following error Cannot interpolate with all object-dtype columns in the DataFrame. Try setting at least one column to a numeric dtype
def get_multi_id_ts_df():
return train_df.astype('object')
tsdata= TSDataset.from_pandas(df, target_col="value", dt_col="datetime",extra_feature_col=['extra feature'])
tsdata.impute("linear")
df = pd.DataFrame({"datetime":np.arange(100),
"id":np.array(['00']*100),
"value":np.random.randn(100),
"extra feature":np.random.randn(100)})
non_pd_datetime
- tsdata.resample('2D') AttributeErr=or: unsupported operand type(s) for -: 'numpy.float64' and 'Timestamp'
- tsdata.gen_dt_feature() AttributeError: Can only use .dt accessor with datetimelike values
- tsdata.gen_rolling_feature(window_size=10) IndexError: single positional indexer is out-of-bounds,(Appears when window_size is too large)
not_aligned
def not_aligned():
df_val = pd.DataFrame({"id":np.array(['00']*20+['01']*30+['02']*50),
"value":np.random.randn(100),
"extra feature":np.random.randn(100)})
data_sec = pd.DataFrame({"datetime": pd.date_range(start='1/1/2019 00:00:00',periods=20,freq='S')})
data_min = pd.DataFrame({"datetime": pd.date_range(start='1/2/2019 00:00:00',periods=30,freq='H')})
data_hou = pd.DataFrame({"datetime": pd.date_range(start='1/3/2019 00:00:00',periods=50,freq='D')})
dt_val = pd.concat([data_sec,data_min,data_hou],axis=0,ignore_index=True)
df = pd.merge(left=dt_val,right=df_val,left_index=True,right_index=True)
return df
- tsdata.resample('2D').roll(lookback=5,horizon=2,id_sensitive=True).to_numpy() # ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 2 has 3 dimension(s)
- tsdata.gen_global_feature ,Same as resample
- gen_rolling_feature(window_size=30) IndexError: single positional indexer is out-of-bounds,(Appears when window_size is too large)
- tsdata.roll(lookback=5,horizon=2,id_sensitive=True) numpy.AxisError: axis 2 is out of bounds for array of dimension 1.
When calling scale(scaler, fit=False)
multiple times, it should behave like calling it only once.
Since it's effective only once when fit=True
.
df = pd.DataFrame({"datetime": np.array(['1/1/2019', '1/2/2019']),
"value": np.array([1, 2])})
df_test = pd.DataFrame({"datetime": np.array(['1/3/2019', '1/4/2019']),
"value": np.array([1, 2])})
tsdata = TSDataset.from_pandas(df,
dt_col="datetime",
target_col="value")
tsdata_test = TSDataset.from_pandas(df_test,
dt_col="datetime",
target_col="value")
standard_scaler = StandardScaler()
tsdata.scale(standard_scaler, fit=True)
tsdata_test.scale(standard_scaler, fit=False).scale(standard_scaler, fit=False)
print(tsdata_test.df)
The expected output value column is [-1, 1]
, currently it is [-5, -1]
Test tsdata random call, there will be the following three types of errors.(use get_multip_df)
- gen_global_feature().gen_rolling_feature() / gen_global_feature().gen_global_feature()
- Dict keys are not allowed to contain '__': extra feature__variance_larger_than_standard_deviation
- gen_dt_feature().gen_global_feature()
- numpy boolean subtract, the
-
operator, is not supported, use the bitwise_xor, the^
operator, or the logical_xor function instead.
- numpy boolean subtract, the
- scale(fit=False)
- not fitted.
In utils/feature.py
, function _is_weekend()
:
the line return (weekday >= 5).values
should be changed to return (weekday >= 5).astype(int).values