RD-Agent icon indicating copy to clipboard operation
RD-Agent copied to clipboard

Model train time fix?

Open inevity opened this issue 6 months ago β€’ 1 comments

πŸ› Bug Description

rdagent fin_mode's model training time exceeds 3600 seconds, so the process is killed. I see Startup Commands: /bin/sh -c timeout --kill-after=10 3600 qrun conf.yaml; entry_exit_code=$?; chmod -R 777 /workspace/qlib_workspace/; exit $entry_exit_code in the rdagent collect_info. But how can i improve the train time? not only increase the kill time. Oh, if even killed train, can we return the error to rdagent, not kill the train process that it will let the rdagent exit? Normally when train the model, what time will it cost? How do i let the rdagent to not run model like this to develop another model? I use rocm/pytorch, should i also change the default model config?

To Reproduce

Steps to reproduce the behavior:

[8:MainThread]([DATETIME],005) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:75] - GeneralPTNN pytorch version...
[8:MainThread]([DATETIME],984) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:93] - GeneralPTNN parameters setting:
n_epochs : 100
lr : 0.001
metric : loss
batch_size : 2000
early_stop : 10
optimizer : adam
loss_type : mse
device : cuda:0
n_jobs : 20
use_GPU : True
weight_decay : 0.0001
seed : None
pt_model_uri: model.model_cls
pt_model_kwargs: {'num_features': 20}
[8:MainThread]([DATETIME],984) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:130] - model:
GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm(
(gru1): GRU(20, 128, batch_first=True)
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(gru2): GRU(128, 128, batch_first=True)
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(gru3): GRU(128, 128, batch_first=True)
(bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(dropout): Dropout(p=0.3, inplace=False)
(fc): Linear(in_features=128, out_features=1, bias=True)
)
[8:MainThread]([DATETIME],984) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:131] - model size: 0.2448 MB
[8:MainThread]([DATETIME],289) INFO - qlib.timer - [log.py:127] - Time cost: 5.926s | Loading data Done
[8:MainThread]([DATETIME],635) INFO - qlib.timer - [log.py:127] - Time cost: 0.009s | FilterCol Done
[8:MainThread]([DATETIME],933) INFO - qlib.timer - [log.py:127] - Time cost: 0.297s | RobustZScoreNorm Done
[8:MainThread]([DATETIME],980) INFO - qlib.timer - [log.py:127] - Time cost: 0.047s | Fillna Done
[8:MainThread]([DATETIME],016) INFO - qlib.timer - [log.py:127] - Time cost: 0.012s | DropnaLabel Done
[8:MainThread]([DATETIME],120) INFO - qlib.timer - [log.py:127] - Time cost: 0.104s | CSRankNorm Done
[8:MainThread]([DATETIME],120) INFO - qlib.timer - [log.py:127] - Time cost: 0.831s | fit & process data Done
[8:MainThread]([DATETIME],121) INFO - qlib.timer - [log.py:127] - Time cost: 6.758s | Init data Done
[8:MainThread]([DATETIME],150) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:246] - Train samples: 478007
[8:MainThread]([DATETIME],150) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:247] - Valid samples: 128309
[8:MainThread]([DATETIME],158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:295] - training...
[8:MainThread]([DATETIME],158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:299] - Epoch0:
[8:MainThread]([DATETIME],158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:300] - training...
[8:MainThread]([DATETIME],161) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:302] - evaluating...
[8:MainThread]([DATETIME],041) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:305] - Epoch0: train 0.998860, valid 1.000189
[8:MainThread]([DATETIME],042) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:299] - Epoch1:
[8:MainThread]([DATETIME],043) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:300] - training...
[8:MainThread]([DATETIME],652) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:299] - Epoch18:
[8:MainThread]([DATETIME],652) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:300] - training...
[8:MainThread]([DATETIME],479) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:302] - evaluating...
───────────────────────────────────────────────────────── Docker Logs End 
2025-05-30 22:13:53.184 | INFO     | rdagent.utils.env:__run_ret_code_with_retry:167 - Running time: 3600.463972091675 seconds
2025-05-30 22:13:53.186 | WARNING  | rdagent.utils.env:__run_ret_code_with_retry:169 - The running time exceeds 3600 seconds, so the process is killed.

2.model code


import torch
import torch.nn as nn

class GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm(nn.Module):
    def __init__(self, num_features):
        super(GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm, self).__init__()
        self.num_features = num_features
        self.gru1 = nn.GRU(num_features, 128, batch_first=True)
        self.bn1 = nn.BatchNorm1d(128)
        self.gru2 = nn.GRU(128, 128, batch_first=True)
        self.bn2 = nn.BatchNorm1d(128)
        self.gru3 = nn.GRU(128, 128, batch_first=True)
        self.bn3 = nn.BatchNorm1d(128)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(128, 1)

    def forward(self, x):
        # x shape: (batch_size, num_features)
        out = self.gru1(x)[0]
        out = self.bn1(out)
        out = self.dropout(out)
        out = self.gru2(out)[0]
        out = self.bn2(out)
        out = self.dropout(out)
        out = self.gru3(out)[0]
        out = self.bn3(out)
        out = self.dropout(out)
        out = self.fc(out)
        return out

model_cls = GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm

#Execution feedback:---------------
#Execution successful, output tensor shape: (8, 1)

#--------------Model value feedback:---------------
#The shape of the output is correct.
#No ground truth output provided. Value evaluation not impractical
  1. a: single 6700xt amd gpu ,12g vram

    b: From rocm/ pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.2.1

    c: qlib master

    d: rd-agent master

Expected Behavior

Screenshot

Environment

Note: Users can run rdagent collect_info to get system information and paste it directly here.

  • Name of current operating system:
  • Processor architecture:
  • System, version, and hardware information:
  • Version number of the system:
  • Python version:
  • Container ID:
  • Container Name:
  • Container Status:
  • Image ID used by the container:
  • Image tag used by the container:
  • Container port mapping:
  • Container Label:
  • Startup Commands:
  • RD-Agent version:
  • Package version:

Additional Notes

inevity avatar May 30 '25 16:05 inevity

First, I located the timeout control parameter in the class FactorCoSTEERSettings, defined in the RD-Agent codebase under rdagent.steer.settings.factor_costeer. This class uses pydantic.BaseSettings along with a SettingsConfigDict that specifies the environment variable prefix FACTOR_COSTEER_.

Because of this design, all class fields (such as file_based_execution_timeout) can be overridden at runtime using environment variables that follow the naming pattern FACTOR_COSTEER_<UPPERCASE_FIELD_NAME>

Specifically, the timeout value for factor implementation execution is defined as: file_based_execution_timeout: int = 3600

This means it defaults to 3600 seconds (1 hour). However, since this parameter is environment-configurable, we don’t need to modify the source code directly.

Instead, before running rdagent fin_mode , we can simply set the environment variable:

export FACTOR_COSTEER_FILE_BASED_EXECUTION_TIMEOUT=7200

Or place the same setting in a .env file. This way, RD-Agent will automatically pick up the new timeout value (7200 seconds) at runtime. This approach is clean, maintainable, and aligned with the design philosophy of the RD-Agent system.

GaryMMMM avatar Jun 07 '25 03:06 GaryMMMM