DI-engine
DI-engine copied to clipboard
为什么collector_env和evaluator_env的pid相同
- [ ] I have marked all applicable categories:
- [ ] exception-raising bug
- [ ] RL algorithm bug
- [ ] system worker bug
- [ ] system utils bug
- [ ] code design/refactor
- [x] documentation request
- [ ] new feature request
- [ ] I have visited the readme and doc
- [ ] I have searched through the issue tracker and pr tracker
- [x] I have mentioned version numbers, operating system and environment, where applicable:
import ding, torch, sys
print(ding.__version__, torch.__version__, sys.version, sys.platform)
v0.4.3 1.12.1+cu116 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)] win32
我用serial_pipeline生成的环境,为什么collector_env和evaluator_env的pid相同? 代码和上一个issue的代码一样: https://github.com/jaried/study_di
2022-10-10 23:26:52 - PID: 14576
2022-10-10 23:27:14 - PID: 14576
你这里的配置使用的是base
,换成subprocess
就是多个进程了。一般来说,base是伪的多进程,用来调试代码;调通没问题之后,就用subprocess
来实际运行。
我改成 env_manager=dict(type='subprocess'),
仍然是同一个进程
2022-10-11 14:42:07 - PID: 16296
2022-10-11 14:44:56 - PID: 16296
而且还有exception:
2022-10-11 14:46:57,001 - root - ERROR - Env 0 reset has exceeded max retries(5)
Traceback (most recent call last):
File "D:\Anaconda3\lib\site-packages\ding\utils\system_helper.py", line 57, in run
self.ret = self._target(*self._args, **self._kwargs)
File "D:\Anaconda3\lib\site-packages\ding\envs\env_manager\subprocess_env_manager.py", line 461, in _reset
raise runtime_error
File "D:\Anaconda3\lib\site-packages\ding\envs\env_manager\subprocess_env_manager.py", line 439, in _reset
reset_fn()
File "D:\Anaconda3\lib\site-packages\ding\envs\env_manager\subprocess_env_manager.py", line 425, in reset_fn
raise ConnectionError("env reset connection timeout") # Leave it to try again
RuntimeError: Env 0 reset has exceeded max retries(5), and the latest exception is: ConnectionError('env reset connection timeout')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\Anaconda3\lib\site-packages\ding\utils\registry.py", line 96, in build
raise e
File "D:\Anaconda3\lib\site-packages\ding\utils\registry.py", line 82, in build
return build_fn(*obj_args, **obj_kwargs)
File "D:\Anaconda3\lib\site-packages\ding\worker\collector\sample_serial_collector.py", line 65, in __init__
self.reset(policy, env)
File "D:\Anaconda3\lib\site-packages\ding\worker\collector\sample_serial_collector.py", line 130, in reset
self.reset_env(_env)
File "D:\Anaconda3\lib\site-packages\ding\worker\collector\sample_serial_collector.py", line 80, in reset_env
self._env.launch()
File "D:\Anaconda3\lib\site-packages\ding\envs\env_manager\subprocess_env_manager.py", line 352, in launch
self.reset(reset_param)
File "D:\Anaconda3\lib\site-packages\ding\envs\env_manager\subprocess_env_manager.py", line 409, in reset
t.join()
File "D:\Anaconda3\lib\site-packages\ding\utils\system_helper.py", line 64, in join
raise RuntimeError('Exception in thread({})'.format(id(self))) from self.exc
RuntimeError: Exception in thread(2335739317600)
python-BaseException
Backend Qt5Agg is interactive backend. Turning interactive mode on.
Process finished with exit code 1
我在我本机运行了你提供的代码(改了一小点代码,不过不影响结论),我这边用subprocess是能跑的,而且PID也不同,如下图所示:
data:image/s3,"s3://crabby-images/a44b6/a44b6da5200343293f51302b25fade3517ff1d1b" alt="Screen Shot 2022-10-13 at 12 48 39 PM"
我的环境是:
import ding, torch, sys
print(ding.__version__, torch.__version__, sys.version, sys.platform)
v0.4.3 1.8.0 3.8.5 (default, Sep 4 2020, 02:22:02)
[Clang 10.0.0 ] darwin
所以看起来可能是windows相关的问题,你是只有这一种系统环境吗?
我的ubuntu 22.04 cuda有点问题,用cpu模式运行了一下subprocess
,还是同样的pid,也报了error
import ding, torch, sys
print(ding.__version__, torch.__version__, sys.version, sys.platform)
v0.4.3 1.12.1+cu116 3.9.13 (main, Aug 25 2022, 23:26:10)
[GCC 11.2.0] linux
我这边在 centos 服务器上运行了一下,也是没问题的。。。运行截图如下:
data:image/s3,"s3://crabby-images/2eb9e/2eb9e28ac19e0633244969a6de9f9c6053e14a82" alt="Screen Shot 2022-10-19 at 6 07 41 PM"
>>> import ding, torch, sys
>>> print(ding.__version__, torch.__version__, sys.version, sys.platform)
v0.4.3 1.10.0+cu102 3.6.8 (default, Nov 16 2020, 16:55:22)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] linux
难道是运行代码有些地方不一致?我把我实际跑的你的仓库代码放到这里了,你可以运行下看看,直接python3 -u option_masac_config.py
我用您的代码新建了一个Project运行了一下,windows下也是正常的。
但是我比较代码没发现哪不同。我自己再找找原因吧。谢谢您的回答。