BigDL-2.x icon indicating copy to clipboard operation
BigDL-2.x copied to clipboard

Chronos TSDataset: various enhancement

Open TheaperDeng opened this issue 3 years ago • 7 comments

  • [x] Enable customized feature generation in gen_dt_feature.
  • [x] API doc for TSDataset.
  • [ ] Onehotencoder operation.
  • [x] _unscale_numpy -> unscale_numpy with api doc
  • [x] resample need default start_time and end_time

TheaperDeng avatar Jun 11 '21 08:06 TheaperDeng

add YEAR feature in gen_dt_feature.

cabuliwallah avatar Jun 23 '21 06:06 cabuliwallah

remove the quote in result column names of gen_dt_feature. e.g. MONTH(StartTime) -> MONTH

cabuliwallah avatar Jun 24 '21 06:06 cabuliwallah

When non_pd_datetime appears, impute("linear") will cause the following error Cannot interpolate with all object-dtype columns in the DataFrame. Try setting at least one column to a numeric dtype

def get_multi_id_ts_df():
    return train_df.astype('object')
tsdata= TSDataset.from_pandas(df, target_col="value", dt_col="datetime",extra_feature_col=['extra feature'])
tsdata.impute("linear")

liangs6212 avatar Jun 29 '21 01:06 liangs6212

df = pd.DataFrame({"datetime":np.arange(100),
                            "id":np.array(['00']*100),
                            "value":np.random.randn(100),
                            "extra feature":np.random.randn(100)})

non_pd_datetime

  • tsdata.resample('2D') AttributeErr=or: unsupported operand type(s) for -: 'numpy.float64' and 'Timestamp'
  • tsdata.gen_dt_feature() AttributeError: Can only use .dt accessor with datetimelike values
  • tsdata.gen_rolling_feature(window_size=10) IndexError: single positional indexer is out-of-bounds,(Appears when window_size is too large)

not_aligned

def not_aligned():
    df_val = pd.DataFrame({"id":np.array(['00']*20+['01']*30+['02']*50),
                            "value":np.random.randn(100),
                            "extra feature":np.random.randn(100)})
    data_sec = pd.DataFrame({"datetime": pd.date_range(start='1/1/2019 00:00:00',periods=20,freq='S')})
    data_min = pd.DataFrame({"datetime": pd.date_range(start='1/2/2019 00:00:00',periods=30,freq='H')})
    data_hou = pd.DataFrame({"datetime": pd.date_range(start='1/3/2019 00:00:00',periods=50,freq='D')})
    dt_val = pd.concat([data_sec,data_min,data_hou],axis=0,ignore_index=True)
    df = pd.merge(left=dt_val,right=df_val,left_index=True,right_index=True)
    return df
  • tsdata.resample('2D').roll(lookback=5,horizon=2,id_sensitive=True).to_numpy() # ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 2 has 3 dimension(s)
  • tsdata.gen_global_feature ,Same as resample
  • gen_rolling_feature(window_size=30) IndexError: single positional indexer is out-of-bounds,(Appears when window_size is too large)
  • tsdata.roll(lookback=5,horizon=2,id_sensitive=True) numpy.AxisError: axis 2 is out of bounds for array of dimension 1.

liangs6212 avatar Jul 02 '21 07:07 liangs6212

When calling scale(scaler, fit=False) multiple times, it should behave like calling it only once. Since it's effective only once when fit=True.

df = pd.DataFrame({"datetime": np.array(['1/1/2019', '1/2/2019']),
                    "value": np.array([1, 2])})
df_test = pd.DataFrame({"datetime": np.array(['1/3/2019', '1/4/2019']),
                    "value": np.array([1, 2])})
tsdata = TSDataset.from_pandas(df,
                               dt_col="datetime",
                               target_col="value")
tsdata_test = TSDataset.from_pandas(df_test,
                               dt_col="datetime",
                               target_col="value")
standard_scaler = StandardScaler()
tsdata.scale(standard_scaler, fit=True)
tsdata_test.scale(standard_scaler, fit=False).scale(standard_scaler, fit=False)
print(tsdata_test.df)

The expected output value column is [-1, 1], currently it is [-5, -1]

cabuliwallah avatar Jul 07 '21 09:07 cabuliwallah

Test tsdata random call, there will be the following three types of errors.(use get_multip_df)

  • gen_global_feature().gen_rolling_feature() / gen_global_feature().gen_global_feature()
    • Dict keys are not allowed to contain '__': extra feature__variance_larger_than_standard_deviation
  • gen_dt_feature().gen_global_feature()
    • numpy boolean subtract, the - operator, is not supported, use the bitwise_xor, the ^ operator, or the logical_xor function instead.
  • scale(fit=False)
    • not fitted.

liangs6212 avatar Jul 08 '21 05:07 liangs6212

In utils/feature.py, function _is_weekend(): the line return (weekday >= 5).values should be changed to return (weekday >= 5).astype(int).values

cabuliwallah avatar Jul 08 '21 08:07 cabuliwallah