PaddleRec
PaddleRec copied to clipboard
【用户使用问题】SR-GNN训练速度及推理速度不及预期
data:image/s3,"s3://crabby-images/b2f52/b2f5279675e31354d4f1094ff38169fef3071e7b" alt="image"
Traceback (most recent call last):
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/utils/envs.py", line 221, in lazy_instance_by_fliename
globals(), locals(), package.split("."))
File "models/recall/gnn/model.py", line 23, in <module>
from paddlerec.core.metrics import RecallK
ImportError: cannot import name 'RecallK' from 'paddlerec.core.metrics' (/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/metrics/__init__.py)
Catch Exception:cannot import name 'RecallK' from 'paddlerec.core.metrics' (/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/metrics/__init__.py)
Traceback (most recent call last):
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 246, in run
self.context_process(self._context)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 207, in context_process
self._status_processor[context['status']](context)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/general_trainer.py", line 90, in network
network_class.build_network(context)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/framework/network.py", line 64, in build_network
model_path, "Model")(context["env"])
TypeError: 'NoneType' object is not callable
Catch Exception:'NoneType' object is not callable
--------------------------------
PaddleRec Error Message Summary:
--------------------------------
Exit PaddleRec. catch exception in precoss status: [network_pass], except: 'NoneType' object is not callable
TypeError
PaddleRec: Runner single_cpu_train Begin
Executor Mode: train
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Warning:please make sure there are no hidden files in the dataset folder and check these hidden files:[]
need_split_files: False
QueueDataset can not support PY3, change to DataLoader
Traceback (most recent call last):
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 256, in run
self.context_process(self._context)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 217, in context_process
self._status_processor[context['status']](context)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/general_trainer.py", line 90, in network
network_class.build_network(context)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/framework/network.py", line 80, in build_network
model._data_loader)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/framework/dataset.py", line 60, in get_dataloader
reader_class_name=reader_class_name)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/utils/dataloader_instance.py", line 96, in dataloader_by_name
return gen_batch_reader()
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/utils/dataloader_instance.py", line 93, in gen_batch_reader
return reader.generate_batch_from_trainfiles(files)
File "models/recall/gnn/reader.py", line 135, in generate_batch_from_trainfiles
self.input = self.base_read(files)
File "models/recall/gnn/reader.py", line 35, in base_read
for line in fin:
File "/home/xulm1/anaconda3/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Catch Exception:'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
--------------------------------
PaddleRec Error Message Summary:
--------------------------------
Exit PaddleRec. catch exception in precoss status: [network_pass], except: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
UnicodeDecodeError
运行的下面这句,第二个是重新安装后的结果
$ python -m paddlerec.run -m models/recall/gnn/config.yaml
不太理解的是召回的Cnt个数为啥越来越多?一共就没那么多item
2020-09-15 15:14:05,122-INFO: [Train], epoch: 0, batch: 1, time_each_interval: 29.89s, LOSS: [10.532445], InsCnt: [10000.], RecallCnt: [73.], Acc(Recall@20): [0.0073]
2020-09-15 15:14:18,110-INFO: [Train], epoch: 0, batch: 2, time_each_interval: 12.99s, LOSS: [10.150826], InsCnt: [15000.], RecallCnt: [266.], Acc(Recall@20): [0.01773333]
2020-09-15 15:14:30,812-INFO: [Train], epoch: 0, batch: 3, time_each_interval: 12.70s, LOSS: [9.429095], InsCnt: [20000.], RecallCnt: [459.], Acc(Recall@20): [0.02295]
2020-09-15 15:14:42,839-INFO: [Train], epoch: 0, batch: 4, time_each_interval: 12.03s, LOSS: [8.945746], InsCnt: [25000.], RecallCnt: [814.], Acc(Recall@20): [0.03256]
2020-09-15 15:14:54,804-INFO: [Train], epoch: 0, batch: 5, time_each_interval: 11.96s, LOSS: [8.617248], InsCnt: [30000.], RecallCnt: [1152.], Acc(Recall@20): [0.0384]
2020-09-15 15:15:06,927-INFO: [Train], epoch: 0, batch: 6, time_each_interval: 12.12s, LOSS: [8.601961], InsCnt: [35000.], RecallCnt: [1509.], Acc(Recall@20): [0.04311429]
2020-09-15 15:15:18,632-INFO: [Train], epoch: 0, batch: 7, time_each_interval: 11.70s, LOSS: [8.352413], InsCnt: [40000.], RecallCnt: [1921.], Acc(Recall@20): [0.048025]
2020-09-15 15:15:30,354-INFO: [Train], epoch: 0, batch: 8, time_each_interval: 11.72s, LOSS: [8.464729], InsCnt: [45000.], RecallCnt: [2270.], Acc(Recall@20): [0.05044444]
100万行训练数据,3万多item,一个batch12s,batch_size=5000,训练一轮需要100万/5000*12s=2400s,而tf版本只需不到10min,同样的数据量,需要提高啊。
这还没用1000万的训练数据呢,咋整啊,大数据还是用不起啊
推理是咋做的啊 每个用户推的items列表怎么取到啊 数据一定要存下来吗??train和test, 然后再读取? 很麻烦,数据处理完就训练不行吗?整个流程
models/recall/gnn/data/config.txt 187993 7806633 这个文件下的俩数字怎么用脚本放到config.yaml文件中啊,这可咋整啊?? 好麻烦啊,我定时训练总不能自己每隔一段时间看看,然后手动改吧
另外如果要改config.yaml中的数据咋整??这种形式好麻烦。 我倒是觉得不如直接来个argparse进行参数的输入
/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py:789: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
Traceback (most recent call last):
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 256, in run
self.context_process(self._context)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 217, in context_process
self._status_processor[context['status']](context)
File "core/trainers/general_trainer.py", line 113, in startup
startup_class.startup(context)
File "/data1/xulm1/PaddleRec/core/trainers/framework/startup.py", line 237, in startup
context["exe"].run(startup_prog)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 790, in run
six.reraise(*sys.exc_info())
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/six.py", line 696, in reraise
raise value
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 785, in run
use_program_cache=use_program_cache)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 838, in _run_impl
use_program_cache=use_program_cache)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 912, in _run_program
fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet:
--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString<std::string>(std::string&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
2 paddle::platform::DeviceContextPool::Get(paddle::platform::Place const&)
3 paddle::framework::GarbageCollector::GarbageCollector(paddle::platform::Place const&, unsigned long)
4 paddle::framework::UnsafeFastGPUGarbageCollector::UnsafeFastGPUGarbageCollector(paddle::platform::CUDAPlace const&, unsigned long)
5 paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
6 paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator<std::string> > const&, bool, bool)
----------------------
Error Message Summary:
----------------------
Error: Place CUDAPlace(2) is not supported, Please check that your paddle compiles with WITH_GPU option or check that your train process hold the correct gpu_id if you use Executor at (/paddle/paddle/fluid/platform/device_context.cc:67)
而实际上是可以用2的
>>> import paddle.fluid as fluid
>>> fluid.CUDAPlace(2)
<paddle.fluid.core_avx.CUDAPlace object at 0x7fcf4e938c30>
>>>
train及infer都用1,显式设置gpu为1
----------------------
Error Message Summary:
----------------------
ResourceExhaustedError:
Out of memory error on GPU 1. Cannot allocate 7.003248GB memory on GPU 1, available memory is only 2.751526GB.
Please check whether there is any other process using GPU 1.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
at (/paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:69)
这说明,train结束后占用的内存并没有释放。 下面试试train 1 infer 0
仍旧不行啊,也不知道改了哪里不该改的了,心累
----------------------
Error Message Summary:
----------------------
Error: Place CUDAPlace(0) is not supported, Please check that your paddle compiles with WITH_GPU option or check that your train process hold the correct gpu_id if you use Executor at (/paddle/paddle/fluid/platform/device_context.cc:67)
EnforceNotMet
离实际应用的距离有点远