PaddleRec 【用户使用问题】SR-GNN训练速度及推理速度不及预期

Sep 15 '20 03:09 MrChengmo

Traceback (most recent call last):
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/utils/envs.py", line 221, in lazy_instance_by_fliename
    globals(), locals(), package.split("."))
  File "models/recall/gnn/model.py", line 23, in <module>
    from paddlerec.core.metrics import RecallK
ImportError: cannot import name 'RecallK' from 'paddlerec.core.metrics' (/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/metrics/__init__.py)
Catch Exception:cannot import name 'RecallK' from 'paddlerec.core.metrics' (/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/metrics/__init__.py)
Traceback (most recent call last):
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 246, in run
    self.context_process(self._context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 207, in context_process
    self._status_processor[context['status']](context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/general_trainer.py", line 90, in network
    network_class.build_network(context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/framework/network.py", line 64, in build_network
    model_path, "Model")(context["env"])
TypeError: 'NoneType' object is not callable
Catch Exception:'NoneType' object is not callable

--------------------------------
PaddleRec Error Message Summary:
--------------------------------

Exit PaddleRec. catch exception in precoss status: [network_pass], except: 'NoneType' object is not callable
TypeError

Sep 15 '20 06:09 ucasiggcas

PaddleRec: Runner single_cpu_train Begin
Executor Mode: train
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Warning:please make sure there are no hidden files in the dataset folder and check these hidden files:[]
need_split_files: False
QueueDataset can not support PY3, change to DataLoader
Traceback (most recent call last):
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 256, in run
    self.context_process(self._context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 217, in context_process
    self._status_processor[context['status']](context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/general_trainer.py", line 90, in network
    network_class.build_network(context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/framework/network.py", line 80, in build_network
    model._data_loader)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/framework/dataset.py", line 60, in get_dataloader
    reader_class_name=reader_class_name)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/utils/dataloader_instance.py", line 96, in dataloader_by_name
    return gen_batch_reader()
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/utils/dataloader_instance.py", line 93, in gen_batch_reader
    return reader.generate_batch_from_trainfiles(files)
  File "models/recall/gnn/reader.py", line 135, in generate_batch_from_trainfiles
    self.input = self.base_read(files)
  File "models/recall/gnn/reader.py", line 35, in base_read
    for line in fin:
  File "/home/xulm1/anaconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Catch Exception:'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

--------------------------------
PaddleRec Error Message Summary:
--------------------------------

Exit PaddleRec. catch exception in precoss status: [network_pass], except: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
UnicodeDecodeError

Sep 15 '20 06:09 ucasiggcas

运行的下面这句，第二个是重新安装后的结果 $ python -m paddlerec.run -m models/recall/gnn/config.yaml

Sep 15 '20 06:09 ucasiggcas

不太理解的是召回的Cnt个数为啥越来越多？一共就没那么多item

Sep 15 '20 07:09 ucasiggcas

2020-09-15 15:14:05,122-INFO: 	[Train],  epoch: 0,  batch: 1, time_each_interval: 29.89s, LOSS: [10.532445], InsCnt: [10000.], RecallCnt: [73.], Acc(Recall@20): [0.0073]
2020-09-15 15:14:18,110-INFO: 	[Train],  epoch: 0,  batch: 2, time_each_interval: 12.99s, LOSS: [10.150826], InsCnt: [15000.], RecallCnt: [266.], Acc(Recall@20): [0.01773333]
2020-09-15 15:14:30,812-INFO: 	[Train],  epoch: 0,  batch: 3, time_each_interval: 12.70s, LOSS: [9.429095], InsCnt: [20000.], RecallCnt: [459.], Acc(Recall@20): [0.02295]
2020-09-15 15:14:42,839-INFO: 	[Train],  epoch: 0,  batch: 4, time_each_interval: 12.03s, LOSS: [8.945746], InsCnt: [25000.], RecallCnt: [814.], Acc(Recall@20): [0.03256]
2020-09-15 15:14:54,804-INFO: 	[Train],  epoch: 0,  batch: 5, time_each_interval: 11.96s, LOSS: [8.617248], InsCnt: [30000.], RecallCnt: [1152.], Acc(Recall@20): [0.0384]
2020-09-15 15:15:06,927-INFO: 	[Train],  epoch: 0,  batch: 6, time_each_interval: 12.12s, LOSS: [8.601961], InsCnt: [35000.], RecallCnt: [1509.], Acc(Recall@20): [0.04311429]
2020-09-15 15:15:18,632-INFO: 	[Train],  epoch: 0,  batch: 7, time_each_interval: 11.70s, LOSS: [8.352413], InsCnt: [40000.], RecallCnt: [1921.], Acc(Recall@20): [0.048025]
2020-09-15 15:15:30,354-INFO: 	[Train],  epoch: 0,  batch: 8, time_each_interval: 11.72s, LOSS: [8.464729], InsCnt: [45000.], RecallCnt: [2270.], Acc(Recall@20): [0.05044444]

100万行训练数据，3万多item，一个batch12s，batch_size=5000，训练一轮需要100万/5000*12s=2400s，而tf版本只需不到10min，同样的数据量，需要提高啊。

Sep 15 '20 07:09 ucasiggcas

这还没用1000万的训练数据呢，咋整啊，大数据还是用不起啊

Sep 15 '20 07:09 ucasiggcas

推理是咋做的啊每个用户推的items列表怎么取到啊数据一定要存下来吗？？train和test，然后再读取？很麻烦，数据处理完就训练不行吗？整个流程

Sep 15 '20 11:09 ucasiggcas

models/recall/gnn/data/config.txt 187993 7806633 这个文件下的俩数字怎么用脚本放到config.yaml文件中啊，这可咋整啊？？好麻烦啊，我定时训练总不能自己每隔一段时间看看，然后手动改吧

Sep 15 '20 13:09 ucasiggcas

另外如果要改config.yaml中的数据咋整？？这种形式好麻烦。我倒是觉得不如直接来个argparse进行参数的输入

Sep 15 '20 13:09 ucasiggcas

/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py:789: UserWarning: The following exception is not an EOF exception.
  "The following exception is not an EOF exception.")
Traceback (most recent call last):
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 256, in run
    self.context_process(self._context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 217, in context_process
    self._status_processor[context['status']](context)
  File "core/trainers/general_trainer.py", line 113, in startup
    startup_class.startup(context)
  File "/data1/xulm1/PaddleRec/core/trainers/framework/startup.py", line 237, in startup
    context["exe"].run(startup_prog)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 790, in run
    six.reraise(*sys.exc_info())
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/six.py", line 696, in reraise
    raise value
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 785, in run
    use_program_cache=use_program_cache)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 838, in _run_impl
    use_program_cache=use_program_cache)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 912, in _run_program
    fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<std::string>(std::string&&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
2   paddle::platform::DeviceContextPool::Get(paddle::platform::Place const&)
3   paddle::framework::GarbageCollector::GarbageCollector(paddle::platform::Place const&, unsigned long)
4   paddle::framework::UnsafeFastGPUGarbageCollector::UnsafeFastGPUGarbageCollector(paddle::platform::CUDAPlace const&, unsigned long)
5   paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
6   paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator<std::string> > const&, bool, bool)

----------------------
Error Message Summary:
----------------------
Error: Place CUDAPlace(2) is not supported, Please check that your paddle compiles with WITH_GPU option or check that your train process hold the correct gpu_id if you use Executor at (/paddle/paddle/fluid/platform/device_context.cc:67)

Sep 16 '20 10:09 ucasiggcas

而实际上是可以用2的

>>> import paddle.fluid as fluid
>>> fluid.CUDAPlace(2)
<paddle.fluid.core_avx.CUDAPlace object at 0x7fcf4e938c30>
>>>

Sep 16 '20 10:09 ucasiggcas

train及infer都用1，显式设置gpu为1

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 1. Cannot allocate 7.003248GB memory on GPU 1, available memory is only 2.751526GB.

Please check whether there is any other process using GPU 1.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 

 at (/paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:69)

这说明，train结束后占用的内存并没有释放。下面试试train 1 infer 0

Sep 16 '20 11:09 ucasiggcas

仍旧不行啊，也不知道改了哪里不该改的了，心累

----------------------
Error Message Summary:
----------------------
Error: Place CUDAPlace(0) is not supported, Please check that your paddle compiles with WITH_GPU option or check that your train process hold the correct gpu_id if you use Executor at (/paddle/paddle/fluid/platform/device_context.cc:67)

EnforceNotMet

离实际应用的距离有点远

Sep 16 '20 12:09 ucasiggcas

Oct 21 '20 09:10 ucasiggcas

PaddleRec PaddleRec copied to clipboard

【用户使用问题】SR-GNN训练速度及推理速度不及预期

PaddleRec
PaddleRec copied to clipboard