Full dataset normalization biases test set

Open theely opened this issue 1 year ago • 1 comments

In FinRL_PortfolioOptimizationEnv_Demo.ipynb all data used for training and testing are normalised at once:

portfolio_norm_df = GroupByScaler(by="tic", scaler=MaxAbsScaler).fit_transform(portfolio_raw_df)

By doing so we tell the model when across all our observations (training + test) the stock price of a given position in the portfolio has been the highest.

@C4i0kun should we not instead normalize only the batch of observations for a given time step? Avoiding to introduce forward looking information, that is supposed to be unknown. For example, we could delegate the normalisation to env_portfolio_optimization.py:

# define data to be used in this time step
self._data_raw = self._df[
    (self._df[self._time_column] >= start_time)
    & (self._df[self._time_column] <= end_time)
][[self._time_column, self._tic_column] + self._features]

# normalise observations
self._data = GroupByScaler(by="tic", scaler=MaxAbsScaler, columns=['close','high','low']).fit_transform(self._data_raw)

Jun 01 '24 12:06 theely

Thank you for bringing up the issue. Currently, the FinRL library is extremely poorly maintained. Rest assured, I will reorganize a team to ensure its proper maintenance.

Best regards,

Bruce Yang

Jun 24 '24 13:06 BruceYanghy