FinRL copied to clipboard
ElegantRL training on paper trading notebook doesn't show the model learning
Using the following ERL parameters:
ERL_PARAMS = {"learning_rate": 3e-6,"batch_size": 2048,"gamma": 0.985, "seed":312,"net_dimension":[128,64], "target_step":50000, "eval_gap":30, "eval_times":5}
When I run training on a larger dataset as seen below
train(start_date = '2005-01-01', end_date = '2022-12-31', ticker_list = ticker_list, data_source = 'alpaca', time_interval= '1Min', technical_indicator_list= INDICATORS, drl_lib='elegantrl', env=env, model_name='ppo', if_vix=True, API_KEY = API_KEY, API_SECRET = API_SECRET, API_BASE_URL = API_BASE_URL, erl_params=ERL_PARAMS, cwd='./papertrading_erl_orig', #current_working_dir break_step=1e7)
My output for the training is:
| `step`: Number of samples, or total training steps, or running times of `env.step()`.
| time
: Time spent from the start of training to this moment.
| avgR
: Average value of cumulative rewards, which is the sum of rewards in an episode.
| stdR
: Standard dev of cumulative rewards, which is the sum of rewards in an episode.
| avgS
: Average of steps in an episode.
| objC
: Objective of Critic network. Or call it loss function of critic network.
| objA
: Objective of Actor network. It is the average Q value of the critic network.
| step time | avgR stdR avgS | objC objA
| 2.00e+04 11 | -0.49 0.02 12345 | 0.05 0.19
| 4.00e+04 22 | -0.49 0.02 12345 | 0.00 0.18
| 6.00e+04 33 | -0.49 0.02 12345 | 0.00 0.19
| 8.00e+04 44 | -0.49 0.01 12345 | 0.00 0.19
| 1.00e+05 55 | -0.48 0.03 12345 | 0.00 0.18
| 1.20e+05 66 | -0.49 0.03 12345 | 0.00 0.17
| 1.40e+05 77 | -0.48 0.02 12345 | 0.00 0.18
| 1.60e+05 88 | -0.50 0.02 12345 | 0.00 0.19
| 1.80e+05 99 | -0.48 0.02 12345 | 0.00 0.18
| 2.00e+05 111 | -0.48 0.03 12345 | 0.00 0.18
| 2.20e+05 122 | -0.48 0.01 12345 | 0.00 0.19
| 2.40e+05 133 | -0.49 0.02 12345 | 0.00 0.18
| 2.60e+05 144 | -0.48 0.03 12345 | 0.00 0.19
| 2.80e+05 155 | -0.49 0.01 12345 | 0.00 0.19
| 3.00e+05 166 | -0.48 0.02 12345 | 0.00 0.19
this output continues even after the training has ran for hours. Shouldn't the avgR and objA values increase slowly over time?
Is this output normal? I have tweaked the ERL params and changed the batch size, learning rate and other setting but I always get the same results. If I change the dataset to a smaller interval, for example 2021-01-01 to 2022-12-31 the avgR increases to over 50 points but again it stays pretty constant.
When I run the test on unseen data I get mixed results. When using SB3 I can see through the explained_variance if the model is learning, in this case I have no clue, is this a bug or normal behavior?
I ended up figuring out the issue. In the ElegantRL code, there is a section that configures the number of time steps, it is hardcoded and doesn't look at the ERL_PARAMS. That limits the number of steps the model runs to train to 1234, independently of what you put in the ERL_PARAMS. You can change it to not being hardcoded and get the ERL_PARAM value or put the value yourself. The code is at
`` def get_rewards_and_steps(env, actor, if_render: bool = False) -> (float, int): # cumulative_rewards and episode_steps device = next(actor.parameters()).device # net.parameters() is a Python generator.
state = env.reset()[0]
episode_steps = 0
cumulative_returns = 0.0 # sum of rewards in an episode
for episode_steps in range(totalTimesteps):
tensor_state = torch.as_tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
tensor_action = actor(tensor_state)
action = tensor_action.detach().cpu().numpy()[0] # not need detach(), because using torch.no_grad() outside
state, reward, done, extra, _ = env.step(action)
cumulative_returns += reward
if if_render:
if done:
return cumulative_returns, episode_steps + 1
This line for episode_steps in range(totalTimesteps):
has a hardcoded 1234 value that limits the model only to use those steps, which in a larger dataset is insufficient to yield good results. I hope it helped. Based on this, I created a new trainer code, which allows multiple training scripts to run simultaneously and greatly improves the data processing time when using Alpaca, including caching processed data. You can check it at