PARL
PARL copied to clipboard
xparl 运行测试代码报错
使用的就是xparl教程中的代码,加入了修饰符的那个版本,想测试一下速度差别; 使用xparl start --port 6006之后已经successful! 后面运行: import threading import parl
#这增加一行 @parl.remote_class class A(object): def run(self): ans = 0 for i in range(100000000): ans += i threads = [] #这增加一行 parl.connect("localhost:6006") for _ in range(5): a = A() th = threading.Thread(target=a.run) th.start() threads.append(th) for th in threads: th.join() 日志中报如下错误: W0316 15:12:49.668996 19068 init.cc:226] Warning: PaddlePaddle catches a failure signal, it may not work properly W0316 15:12:49.669054 19068 init.cc:228] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle W0316 15:12:49.669065 19068 init.cc:231] The detail failure signal is:
W0316 15:12:49.669078 19068 init.cc:234] *** Aborted at 1615878769 (unix time) try "date -d @1615878769" if you are using GNU date *** W0316 15:12:49.671977 19068 init.cc:234] PC: @ 0x0 (unknown) W0316 15:12:49.672797 19068 init.cc:234] *** SIGTERM (@0x3e900004810) received by PID 19068 (TID 0x7fab9348d740) from PID 18448; stack trace: *** W0316 15:12:49.675886 19068 init.cc:234] @ 0x7fab930a6980 (unknown) W0316 15:12:49.678494 19068 init.cc:234] @ 0x7fab9240acb9 __poll W0316 15:12:49.678943 19068 init.cc:234] @ 0x7fab6f156945 (unknown) W0316 15:12:49.679361 19068 init.cc:234] @ 0x7fab6f12cc21 (unknown) W0316 15:12:49.679793 19068 init.cc:234] @ 0x7fab6f15c1b2 (unknown) W0316 15:12:49.680192 19068 init.cc:234] @ 0x7fab6f15bcc0 (unknown) W0316 15:12:49.680472 19068 init.cc:234] @ 0x7fab6f183826 (unknown) W0316 15:12:49.680754 19068 init.cc:234] @ 0x7fab6f18401f zmq_msg_recv W0316 15:12:49.680963 19068 init.cc:234] @ 0x7fab6e586a7d (unknown) W0316 15:12:49.681190 19068 init.cc:234] @ 0x7fab6e57d50a (unknown) W0316 15:12:49.681409 19068 init.cc:234] @ 0x7fab6ebcb934 (unknown) W0316 15:12:49.682670 19068 init.cc:234] @ 0x556d535c3963 _PyObject_FastCallKeywords W0316 15:12:49.682814 19068 init.cc:234] @ 0x556d535c441e call_function W0316 15:12:49.683812 19068 init.cc:234] @ 0x556d53622776 _PyEval_EvalFrameDefault W0316 15:12:49.683913 19068 init.cc:234] @ 0x556d53568c92 _PyEval_EvalCodeWithName W0316 15:12:49.684077 19068 init.cc:234] @ 0x556d535c2648 fast_function W0316 15:12:49.684201 19068 init.cc:234] @ 0x556d535c430a call_function W0316 15:12:49.685236 19068 init.cc:234] @ 0x556d53621894 _PyEval_EvalFrameDefault W0316 15:12:49.686053 19068 init.cc:234] @ 0x556d5356a26a _PyFunction_FastCallDict W0316 15:12:49.686204 19068 init.cc:234] @ 0x556d535d44d3 method_call W0316 15:12:49.686354 19068 init.cc:234] @ 0x556d535f1bd8 slot_tp_init W0316 15:12:49.686424 19068 init.cc:234] @ 0x556d535798a7 type_call W0316 15:12:49.687206 19068 init.cc:234] @ 0x556d535c3963 _PyObject_FastCallKeywords W0316 15:12:49.687331 19068 init.cc:234] @ 0x556d535c441e call_function W0316 15:12:49.688135 19068 init.cc:234] @ 0x556d53621894 _PyEval_EvalFrameDefault W0316 15:12:49.688207 19068 init.cc:234] @ 0x556d53568c92 _PyEval_EvalCodeWithName W0316 15:12:49.689034 19068 init.cc:234] @ 0x556d5356a079 PyEval_EvalCodeEx W0316 15:12:49.689846 19068 init.cc:234] @ 0x556d53639feb PyEval_EvalCode W0316 15:12:49.690042 19068 init.cc:234] @ 0x556d536a22b3 run_mod W0316 15:12:49.690891 19068 init.cc:234] @ 0x556d536a2897 PyRun_FileExFlags W0316 15:12:49.691754 19068 init.cc:234] @ 0x556d536a2a6c PyRun_SimpleFileExFlags W0316 15:12:49.692699 19068 init.cc:234] @ 0x556d536a7c47 Py_Main 命令行显示: [03-16 15:11:04 Thread-9 @client.py:296] ERR [xparl] lost connection with a job, current actor num: 2 [03-16 15:11:57 Thread-11 @client.py:296] ERR [xparl] lost connection with a job, current actor num: 2 [03-16 15:12:52 Thread-13 @client.py:296] ERR [xparl] lost connection with a job, current actor num: 2 [03-16 15:13:45 Thread-15 @client.py:296] ERR [xparl] lost connection with a job, current actor num: 1
parl版本:1.4.3 paddle版本:1.0.2 paddlepaddle:1.8.5
你好,麻烦提供下运行的系统环境和python版本,另外“paddle版本:1.0.2”这块是指?
"W0316 15:12:49.668996 19068 init.cc:226] Warning: PaddlePaddle catches a failure signal, it may not work properly" 这个是paddle抛出的警告提示(warnings),应该不影响PARL的运行。
感谢您的及时回复: python:3.7.7 & 3.8.8(两个版本都试过了) ubuntu:18.04 cuda:11.2 cudnn:8 使用的是paddle的cpu版本 paddle1.0.2是因为在使用parl这个包进行验证的时候,总是提醒没有paddle包,就又单独pip install paddle,然后pip list 看到的paddle版本:1.0.2
你好,我们这边在ubuntu + python 3.7.7(conda) + parl==1.4.3 测试,是可以正常运行示例,但下载paddlepaddle==1.8.5后确实会有paddlepaddle的c++警告,但不影响代码运行逻辑。
建议可以基于conda新创建一个python环境测试下,另外,可以在代码运行结束后打印提示信息,例如下面:
- 启动xparl
xparl start --port 6006 --cpu_num 5
- 运行下面代码
import threading
import parl
#这增加一行
@parl.remote_class
class A(object):
def run(self):
ans = 0
for i in range(100):
ans += i
threads = []
#这增加一行
parl.connect("localhost:6006")
for _ in range(5):
a = A()
th = threading.Thread(target=a.run)
th.start()
threads.append(th)
for th in threads:
th.join()
print("finished")
您的意思是不影响最终的输出吗?我在最后输出了递增的结果是错误的,貌似并没有进行计算,而且时长更久了,这显然不符合并行测试的目的:时长是:466.127126455307 ,比前面的方法消耗时长更久了~ 终端有err输出: [03-17 11:15:53 MainThread @client.py:434] Remote actors log url: http://103.53.211.109:50885/logs?client_id=103.53.211.109_43769_1615950953 [03-17 11:18:55 Thread-9 @client.py:294] ERR [xparl] lost connection with a job, current actor num: 2 [03-17 11:20:33 Thread-11 @client.py:294] ERR [xparl] lost connection with a job, current actor num: 2 [03-17 11:22:05 Thread-13 @client.py:294] ERR [xparl] lost connection with a job, current actor num: 2 Exception ignored in: <function RemoteWrapper.del at 0x7f45de136f70> Traceback (most recent call last): File "/home/fyt/miniconda3/envs/paddle/lib/python3.8/site-packages/parl/remote/remote_wrapper.py", line 138, in del AttributeError: 'NoneType' object has no attribute 'ZMQError'
可以基于conda新创建一个python环境测试下吗,并提供完整的log信息?
使用conda建立了新的测试环境,python:3.7.7 ,parl:1.4.3;未安装paddlepaddle,测试仍然出现上述问题: 代码如下: import time class A(object): def run(self): ans = 0 for i in range(100000000): ans += i return ans start = time.time() a = A() s = 0 for _ in range(5): s += a.run() end = time.time() print(end-start,s)
print("==========================test2===========================") import threading start = time.time() s = 0 print(s) class B(object): def run(self): global s for i in range(100000000): s += i return s threads = [] for _ in range(5): a = B() th = threading.Thread(target=a.run) th.start() threads.append(th) for th in threads: th.join()
end = time.time() print(end-start,s)
print("==========================test3===========================") start = time.time() import threading import parl s = 0 print(s) #这增加一行 @parl.remote_class class C(object): def run(self): global s for i in range(100000000): s += i return s threads = [] #这增加一行 parl.connect("localhost:6006") for _ in range(5): a = C() th = threading.Thread(target=a.run) th.start() threads.append(th) for th in threads: th.join() end = time.time() print(end-start,s)
终端输出如下: 25.49510097503662 24999999750000000 ==========================test2=========================== 0 53.94969964027405 7655029083830475 ==========================test3=========================== [03-17 15:29:57 MainThread @logger.py:242] Argv: test.py [03-17 15:29:57 MainThread @init.py:38] WRN No deep learning framework was found, but it's ok for parallel computation. 0 [03-17 15:29:57 MainThread @client.py:435] Remote actors log url: http://103.53.211.109:44553/logs?client_id=103.53.211.109_40061_1615966197 [03-17 15:32:39 Thread-9 @client.py:296] ERR [xparl] lost connection with a job, current actor num: 2 [03-17 15:33:59 Thread-11 @client.py:296] ERR [xparl] lost connection with a job, current actor num: 2 [03-17 15:35:15 Thread-13 @client.py:296] ERR [xparl] lost connection with a job, current actor num: 2 [03-17 15:36:23 Thread-15 @client.py:296] ERR [xparl] lost connection with a job, current actor num: 1 391.2815408706665 0
运行test3之前xparl是启动了5个cpu吗?,如下:
xparl start --port 6006 --cpu_num 5
是的
好的,这边方便把贴的代码格式化下缩进吗(例如将代码放在两个```之间)?我们这边测试下
import time
class A(object):
def run(self):
ans = 0
for i in range(100000000):
ans += i
return ans
start = time.time()
a = A()
s = 0
for _ in range(5):
s += a.run()
end = time.time()
print(end-start,s)
print("==========================test2===========================")
import threading
start = time.time()
s = 0
print(s)
class B(object):
def run(self):
global s
for i in range(100000000):
s += i
return s
threads = []
for _ in range(5):
a = B()
th = threading.Thread(target=a.run)
th.start()
threads.append(th)
for th in threads:
th.join()
end = time.time()
print(end-start,s)
print("==========================test3===========================")
start = time.time()
import threading
import parl
s = 0
print(s)
#这增加一行
@parl.remote_class
class C(object):
def run(self):
global s
for i in range(100000000):
s += i
return s
threads = []
#这增加一行
parl.connect("localhost:6006")
for _ in range(5):
a = C()
th = threading.Thread(target=a.run)
th.start()
threads.append(th)
for th in threads:
th.join()
end = time.time()
print(end-start,s)
你好,我们这边测试了下,输出结果是可以复现的。test3(PARL并行)运行时间比较长的原因是PARL会将被@parl.remote_class修饰的类所在的代码文件放到远端(本机其他进程、或其他机器)执行,在这个示例中获取@parl.remote_class修饰的类时会在远端重新执行test1和test2,导致运行时间更长。所以你可以将test3放在最前面测试或放在一个独立的文件中。
另外,关于test3的输出s是0的原因也是类似的,@parl.remote_class修饰的类是放在远端执行,所以本地的s并不会被累加,仍然是0。
非常感谢您的回答,上次的问题已经解决,使用了上面的方法,速度提升效果不大,后来改用了multiprocessing.Process,速度提升更加明显。非常感谢您的及时回复! 现在还有一个问题,是paddle下面的optimizer优化器中是不是没有二阶的优化器,比如牛顿法,拟牛顿法或者bfgs之类的,如果我自己实现的话有相关的教程吗?有点不知道该怎么下手~希望可以指教一下。
算法内容我明白,主要是往这个框架上面搭的时候不知道怎么下手~