can we use gpu when run demo fin_model?
when i run "rdagent fin_model", it works well on my cpu to train a GRU. How to use gpu device such as "cuda:0" to run this demo? Some outputs of my terminal when running this script are as follows:
[1:MainThread](2024-10-21 03:13:05,144) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:74] - GeneralPTNN pytorch version... [1:MainThread](2024-10-21 03:13:05,157) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:92] - GeneralPTNN parameters setting: n_epochs : 100 lr : 0.001 metric : loss batch_size : 2000 early_stop : 10 optimizer : adam loss_type : mse device : cpu n_jobs : 20 use_GPU : False weight_decay : 0.0001 seed : None pt_model_uri: model.model_cls pt_model_kwargs: {'num_features': 20, 'num_timesteps': 20} [1:MainThread](2024-10-21 03:13:05,158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:129] - model: EnhancedDeepGRUModel( (gru): GRU(20, 256, num_layers=5, batch_first=True, dropout=0.4) (fc): Linear(in_features=256, out_features=1, bias=True) )
Hi,
You could firstly check if you've chosen the correct base image in your Dockerfile to support GPU functionality.
The Dockerfile can be found at rdagent/scenarios/qlib/docker.
I think I have right docker file, the codes are listed below. `FROM pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime
For GPU support, please choose the proper tag from https://hub.docker.com/r/pytorch/pytorch/tags
RUN apt-get clean && apt-get update && apt-get install -y \
curl \
vim \
git \
build-essential
&& rm -rf /var/lib/apt/lists/*
RUN git clone https://github.com/microsoft/qlib.git
WORKDIR /workspace/qlib
RUN git reset c9ed050ef034fe6519c14b59f3d207abcb693282 --hard
RUN python -m pip install --upgrade cython -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple RUN python -m pip install -e . -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
RUN pip install catboost -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple RUN pip install xgboost -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple RUN pip install scipy==1.11.4 -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple `
I also successfully generarte docker image called "local_qlib", and if I run this image by "docker run --rm -ti --gpus all local_qlib /bin/bash", I can see normal output by running "nvidia-smi" in this image.
`
(rdagent) youme@youme-System-Product-Name:~/Documents/PythonProjects/RD-Agent$ docker run --rm -ti --gpus all local_qlib /bin/bash
root@8fa2d3b4c6eb:/workspace/qlib# nvidia-smi
Mon Oct 21 12:30:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 Ti Off | 00000000:0A:00.0 On | N/A |
| 44% 55C P2 111W / 350W | 2724MiB / 12288MiB | 16% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ root@8fa2d3b4c6eb:/workspace/qlib# ^C root@8fa2d3b4c6eb:/workspace/qlib# exit `
However, when I run "rdagent fin_model", the ERROR are listed below.
[1:MainThread](2024-10-21 12:20:21,034) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:129] - model: DeepGRUModel( (gru): GRU(20, 128, num_layers=3, batch_first=True, dropout=0.2) (fc): Linear(in_features=128, out_features=1, bias=True) ) [1:MainThread](2024-10-21 12:20:21,034) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:130] - model size: 0.2440 MB [1:MainThread](2024-10-21 12:20:21,520) INFO - qlib.timer - [log.py:127] - Time cost: 0.000s | waitingasync_logDone [1:MainThread](2024-10-21 12:20:21,522) ERROR - qlib.workflow - [utils.py:41] - An exception has been raised[RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSAto enable device-side assertions. ]. File "/opt/conda/bin/qrun", line 8, in <module> sys.exit(run()) File "/workspace/qlib/qlib/workflow/cli.py", line 151, in run fire.Fire(workflow) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/workspace/qlib/qlib/workflow/cli.py", line 145, in workflow recorder = task_train(config.get("task"), experiment_name=experiment_name) File "/workspace/qlib/qlib/model/trainer.py", line 127, in task_train _exe_task(task_config) File "/workspace/qlib/qlib/model/trainer.py", line 45, in _exe_task model: Model = init_instance_by_config(task_config["model"], accept_types=Model) File "/workspace/qlib/qlib/utils/mod.py", line 180, in init_instance_by_config return klass(**cls_kwargs, **try_kwargs, **kwargs) File "/workspace/qlib/qlib/contrib/model/pytorch_general_nn.py", line 140, in __init__ self.dnn_model.to(self.device) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to return self._apply(convert) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 216, in _apply ret = super()._apply(fn, recurse) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSAto enable device-side assertions.
Besides, it seems the docker container can correctly detect the gpu device, the log detail are listed below.
2024-10-21 20:20:18.348 | INFO | rdagent.utils.env:_gpu_kwargs:269 - GPU Devices are available.