FinRL
FinRL copied to clipboard
ElegantRL training on paper trading notebook doesn't show the model learning
Using the following ERL parameters:
ERL_PARAMS = {"learning_rate": 3e-6,"batch_size": 2048,"gamma": 0.985, "seed":312,"net_dimension":[128,64], "target_step":50000, "eval_gap":30, "eval_times":5}
When I run training on a larger dataset as seen below
train(start_date = '2005-01-01', end_date = '2022-12-31', ticker_list = ticker_list, data_source = 'alpaca', time_interval= '1Min', technical_indicator_list= INDICATORS, drl_lib='elegantrl', env=env, model_name='ppo', if_vix=True, API_KEY = API_KEY, API_SECRET = API_SECRET, API_BASE_URL = API_BASE_URL, erl_params=ERL_PARAMS, cwd='./papertrading_erl_orig', #current_working_dir break_step=1e7)
My output for the training is:
| `step`: Number of samples, or total training steps, or running times of `env.step()`.
| time
: Time spent from the start of training to this moment.
| avgR
: Average value of cumulative rewards, which is the sum of rewards in an episode.
| stdR
: Standard dev of cumulative rewards, which is the sum of rewards in an episode.
| avgS
: Average of steps in an episode.
| objC
: Objective of Critic network. Or call it loss function of critic network.
| objA
: Objective of Actor network. It is the average Q value of the critic network.
| step time | avgR stdR avgS | objC objA
| 2.00e+04 11 | -0.49 0.02 12345 | 0.05 0.19
| 4.00e+04 22 | -0.49 0.02 12345 | 0.00 0.18
| 6.00e+04 33 | -0.49 0.02 12345 | 0.00 0.19
| 8.00e+04 44 | -0.49 0.01 12345 | 0.00 0.19
| 1.00e+05 55 | -0.48 0.03 12345 | 0.00 0.18
| 1.20e+05 66 | -0.49 0.03 12345 | 0.00 0.17
| 1.40e+05 77 | -0.48 0.02 12345 | 0.00 0.18
| 1.60e+05 88 | -0.50 0.02 12345 | 0.00 0.19
| 1.80e+05 99 | -0.48 0.02 12345 | 0.00 0.18
| 2.00e+05 111 | -0.48 0.03 12345 | 0.00 0.18
| 2.20e+05 122 | -0.48 0.01 12345 | 0.00 0.19
| 2.40e+05 133 | -0.49 0.02 12345 | 0.00 0.18
| 2.60e+05 144 | -0.48 0.03 12345 | 0.00 0.19
| 2.80e+05 155 | -0.49 0.01 12345 | 0.00 0.19
| 3.00e+05 166 | -0.48 0.02 12345 | 0.00 0.19
this output continues even after the training has ran for hours. Shouldn't the avgR and objA values increase slowly over time?
Is this output normal? I have tweaked the ERL params and changed the batch size, learning rate and other setting but I always get the same results. If I change the dataset to a smaller interval, for example 2021-01-01 to 2022-12-31 the avgR increases to over 50 points but again it stays pretty constant.
When I run the test on unseen data I get mixed results. When using SB3 I can see through the explained_variance if the model is learning, in this case I have no clue, is this a bug or normal behavior?
I ended up figuring out the issue. In the ElegantRL code, there is a section that configures the number of time steps, it is hardcoded and doesn't look at the ERL_PARAMS. That limits the number of steps the model runs to train to 1234, independently of what you put in the ERL_PARAMS. You can change it to not being hardcoded and get the ERL_PARAM value or put the value yourself. The code is at
`` def get_rewards_and_steps(env, actor, if_render: bool = False) -> (float, int): # cumulative_rewards and episode_steps device = next(actor.parameters()).device # net.parameters() is a Python generator.
state = env.reset()[0]
episode_steps = 0
cumulative_returns = 0.0 # sum of rewards in an episode
for episode_steps in range(totalTimesteps):
tensor_state = torch.as_tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
tensor_action = actor(tensor_state)
action = tensor_action.detach().cpu().numpy()[0] # not need detach(), because using torch.no_grad() outside
state, reward, done, extra, _ = env.step(action)
cumulative_returns += reward
if if_render:
env.render()
if done:
break
return cumulative_returns, episode_steps + 1
``
This line for episode_steps in range(totalTimesteps):
has a hardcoded 1234 value that limits the model only to use those steps, which in a larger dataset is insufficient to yield good results. I hope it helped. Based on this, I created a new trainer code, which allows multiple training scripts to run simultaneously and greatly improves the data processing time when using Alpaca, including caching processed data. You can check it at https://github.com/mikazlopes/training-farm