DI-engine icon indicating copy to clipboard operation
DI-engine copied to clipboard

为什么collector_env和evaluator_env的pid相同

Open jaried opened this issue 2 years ago • 2 comments

  • [ ] I have marked all applicable categories:
    • [ ] exception-raising bug
    • [ ] RL algorithm bug
    • [ ] system worker bug
    • [ ] system utils bug
    • [ ] code design/refactor
    • [x] documentation request
    • [ ] new feature request
  • [ ] I have visited the readme and doc
  • [ ] I have searched through the issue tracker and pr tracker
  • [x] I have mentioned version numbers, operating system and environment, where applicable:
import ding, torch, sys
print(ding.__version__, torch.__version__, sys.version, sys.platform)
v0.4.3 1.12.1+cu116 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)] win32

我用serial_pipeline生成的环境,为什么collector_env和evaluator_env的pid相同? 代码和上一个issue的代码一样: https://github.com/jaried/study_di

2022-10-10 23:26:52 - PID: 14576
2022-10-10 23:27:14 - PID: 14576

jaried avatar Oct 10 '22 15:10 jaried

你这里的配置使用的是base,换成subprocess就是多个进程了。一般来说,base是伪的多进程,用来调试代码;调通没问题之后,就用subprocess来实际运行。

PaParaZz1 avatar Oct 11 '22 06:10 PaParaZz1

我改成 env_manager=dict(type='subprocess'), 仍然是同一个进程

2022-10-11 14:42:07 - PID: 16296
2022-10-11 14:44:56 - PID: 16296

而且还有exception:


2022-10-11 14:46:57,001 - root - ERROR - Env 0 reset has exceeded max retries(5)
Traceback (most recent call last):
  File "D:\Anaconda3\lib\site-packages\ding\utils\system_helper.py", line 57, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "D:\Anaconda3\lib\site-packages\ding\envs\env_manager\subprocess_env_manager.py", line 461, in _reset
    raise runtime_error
  File "D:\Anaconda3\lib\site-packages\ding\envs\env_manager\subprocess_env_manager.py", line 439, in _reset
    reset_fn()
  File "D:\Anaconda3\lib\site-packages\ding\envs\env_manager\subprocess_env_manager.py", line 425, in reset_fn
    raise ConnectionError("env reset connection timeout")  # Leave it to try again
RuntimeError: Env 0 reset has exceeded max retries(5), and the latest exception is: ConnectionError('env reset connection timeout')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Anaconda3\lib\site-packages\ding\utils\registry.py", line 96, in build
    raise e
  File "D:\Anaconda3\lib\site-packages\ding\utils\registry.py", line 82, in build
    return build_fn(*obj_args, **obj_kwargs)
  File "D:\Anaconda3\lib\site-packages\ding\worker\collector\sample_serial_collector.py", line 65, in __init__
    self.reset(policy, env)
  File "D:\Anaconda3\lib\site-packages\ding\worker\collector\sample_serial_collector.py", line 130, in reset
    self.reset_env(_env)
  File "D:\Anaconda3\lib\site-packages\ding\worker\collector\sample_serial_collector.py", line 80, in reset_env
    self._env.launch()
  File "D:\Anaconda3\lib\site-packages\ding\envs\env_manager\subprocess_env_manager.py", line 352, in launch
    self.reset(reset_param)
  File "D:\Anaconda3\lib\site-packages\ding\envs\env_manager\subprocess_env_manager.py", line 409, in reset
    t.join()
  File "D:\Anaconda3\lib\site-packages\ding\utils\system_helper.py", line 64, in join
    raise RuntimeError('Exception in thread({})'.format(id(self))) from self.exc
RuntimeError: Exception in thread(2335739317600)
python-BaseException
Backend Qt5Agg is interactive backend. Turning interactive mode on.

Process finished with exit code 1

jaried avatar Oct 11 '22 06:10 jaried

我在我本机运行了你提供的代码(改了一小点代码,不过不影响结论),我这边用subprocess是能跑的,而且PID也不同,如下图所示:

Screen Shot 2022-10-13 at 12 48 39 PM

我的环境是:

import ding, torch, sys
print(ding.__version__, torch.__version__, sys.version, sys.platform)
v0.4.3 1.8.0 3.8.5 (default, Sep  4 2020, 02:22:02) 
[Clang 10.0.0 ] darwin

所以看起来可能是windows相关的问题,你是只有这一种系统环境吗?

PaParaZz1 avatar Oct 13 '22 04:10 PaParaZz1

我的ubuntu 22.04 cuda有点问题,用cpu模式运行了一下subprocess,还是同样的pid,也报了error

import ding, torch, sys
print(ding.__version__, torch.__version__, sys.version, sys.platform)
v0.4.3 1.12.1+cu116 3.9.13 (main, Aug 25 2022, 23:26:10) 
[GCC 11.2.0] linux

image

jaried avatar Oct 13 '22 07:10 jaried

我这边在 centos 服务器上运行了一下,也是没问题的。。。运行截图如下:

Screen Shot 2022-10-19 at 6 07 41 PM 相关库版本信息是:
>>> import ding, torch, sys
>>> print(ding.__version__, torch.__version__, sys.version, sys.platform)
v0.4.3 1.10.0+cu102 3.6.8 (default, Nov 16 2020, 16:55:22) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] linux

难道是运行代码有些地方不一致?我把我实际跑的你的仓库代码放到这里了,你可以运行下看看,直接python3 -u option_masac_config.py

PaParaZz1 avatar Oct 19 '22 10:10 PaParaZz1

我用您的代码新建了一个Project运行了一下,windows下也是正常的。

但是我比较代码没发现哪不同。我自己再找找原因吧。谢谢您的回答。

jaried avatar Oct 19 '22 10:10 jaried