DouZero icon indicating copy to clipboard operation
DouZero copied to clipboard

为什么同样是4个GPU我的训练时候的FPS很低呢,基本都在2000左右

Open mfxiaosheng opened this issue 1 year ago • 7 comments

[INFO:1052 dmc:233 2022-07-20 17:39:38,765] After 1632000 (L:556800 U:528000 D:547200) frames: @ 1918.7 fps (avg@ 2318.1 fps) (L:0.0 U:0.0 D:1918.7) Stats: {'loss_landlord': 1.9155352115631104, 'loss_landlord_down': 2.5349276065826416, 'loss_landlord_up': 2.1095376014709473, 'mean_episode_return_landlord': 0.08421196788549423, 'mean_episode_return_landlord_down': -0.08074238896369934, 'mean_episode_return_landlord_up': -0.06534682214260101} [INFO:1052 dmc:233 2022-07-20 17:39:43,769] After 1648000 (L:563200 U:537600 D:547200) frames: @ 3197.8 fps (avg@ 2398.1 fps) (L:1279.1 U:1918.7 D:0.0) Stats: {'loss_landlord': 2.3213179111480713, 'loss_landlord_down': 2.5349276065826416, 'loss_landlord_up': 2.6052844524383545, 'mean_episode_return_landlord': 0.09171878546476364, 'mean_episode_return_landlord_down': -0.08074238896369934, 'mean_episode_return_landlord_up': -0.08009536564350128} [INFO:1052 dmc:233 2022-07-20 17:39:48,773] After 1654400 (L:569600 U:537600 D:547200) frames: @ 1279.1 fps (avg@ 2398.1 fps) (L:1279.1 U:0.0 D:0.0) Stats: {'loss_landlord': 2.185067892074585, 'loss_landlord_down': 2.5349276065826416, 'loss_landlord_up': 2.6052844524383545, 'mean_episode_return_landlord': 0.09759927541017532, 'mean_episode_return_landlord_down': -0.08074238896369934, 'mean_episode_return_landlord_up': -0.08009536564350128} [INFO:1052 dmc:233 2022-07-20 17:39:53,779] After 1673600 (L:576000 U:540800 D:556800) frames: @ 3836.1 fps (avg@ 2344.8 fps) (L:1278.7 U:639.4 D:1918.1) Stats: {'loss_landlord': 1.77787184715271, 'loss_landlord_down': 2.7444241046905518, 'loss_landlord_up': 2.508575677871704, 'mean_episode_return_landlord': 0.10005713254213333, 'mean_episode_return_landlord_down': -0.09260766953229904, 'mean_episode_return_landlord_up': -0.08521360903978348} [INFO:1052 dmc:233 2022-07-20 17:39:58,781] After 1680000 (L:576000 U:547200 D:556800) frames: @ 1279.5 fps (avg@ 2398.1 fps) (L:0.0 U:1279.5 D:0.0) Stats: {'loss_landlord': 1.77787184715271, 'loss_landlord_down': 2.7444241046905518, 'loss_landlord_up': 2.264894723892212, 'mean_episode_return_landlord': 0.10005713254213333, 'mean_episode_return_landlord_down': -0.09260766953229904, 'mean_episode_return_landlord_up': -0.08965221047401428}

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:0A:00.0 Off | 0 | | N/A 31C P0 96W / 400W | 66690MiB / 81251MiB | 99% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:45:00.0 Off | 0 | | N/A 32C P0 95W / 400W | 66704MiB / 81251MiB | 98% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... On | 00000000:4B:00.0 Off | 0 | | N/A 34C P0 95W / 400W | 66700MiB / 81251MiB | 98% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... On | 00000000:84:00.0 Off | 0 | | N/A 39C P0 66W / 400W | 2653MiB / 81251MiB | 2% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+

fps一直很低,偶尔会出现FPS0的情况 偶尔也会跳到5000.请问这是正常训练的速度吗

mfxiaosheng avatar Jul 20 '22 09:07 mfxiaosheng

你多卡训练的时候,遇到没有一个问题。生成的act进程,都会在0卡上占用一个相同的内存,导致启动了几个actor后,就会导致0卡显存不足,报cuda错误。

1978mountain avatar Aug 03 '22 09:08 1978mountain

请问为什么我再阿里云上租的a100 fps只有600多,你用的命令参数是什么,可以分享一下吗

zgz682000 avatar Apr 04 '23 09:04 zgz682000

@zgz682000 有试过其它型号GPU嘛?

daochenzha avatar Apr 08 '23 04:04 daochenzha

@zgz682000 有试过其它型号GPU嘛?

是的,我自己的pc显卡是1060,fps都有1000以上。

zgz682000 avatar Apr 10 '23 02:04 zgz682000

@zgz682000 这个我也不知道为什么,可以换换别的显卡试试

daochenzha avatar Apr 10 '23 20:04 daochenzha

@mfxiaosheng 您好,我遇到了跟您一样的问题,您在后续有解决吗?训练的速度后续还有提升过吗?

Cyclones-Y avatar Oct 17 '23 02:10 Cyclones-Y

我也遇到了同样的问题,请问有没有大佬支援

aishxi avatar Nov 29 '23 04:11 aishxi