DeepCTR icon indicating copy to clipboard operation
DeepCTR copied to clipboard

DIN模型,run_din遇到的问题

Open DoubleYing opened this issue 4 years ago • 5 comments

Describe the question(问题描述) 直接运行run_din.py,运行epoch设置为10,运行了大概6次后,报以下错误。大概是IteratorResource does not exist,但是我不知道为啥会出现这个问题。可不可以请大佬指点一下。

`Train on 1 samples, validate on 2 samples Epoch 1/10 C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\framework\indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\framework\indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " 2019-11-15 17:44:39.292483: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-11-15 17:44:39.296103: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-11-15 17:44:39.306838: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'model/attention_sequence_pooling_layer/local_activation_unit/concat' has self cycle fanin 'model/attention_sequence_pooling_layer/local_activation_unit/concat'. 2019-11-15 17:44:39.314760: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'model/attention_sequence_pooling_layer/local_activation_unit/concat' has self cycle fanin 'model/attention_sequence_pooling_layer/local_activation_unit/concat'. 2019-11-15 17:44:39.316846: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] arithmetic_optimizer failed: Invalid argument: The graph couldn't be sorted in topological order. 2019-11-15 17:44:39.321719: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-11-15 17:44:39.325654: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-11-15 17:44:39.330328: W tensorflow/core/common_runtime/process_function_library_runtime.cc:675] Ignoring multi-device function optimization failure: Invalid argument: The graph couldn't be sorted in topological order.

1/1 [==============================] - 5s 5s/sample - loss: 0.7042 - binary_crossentropy: 0.7042 - val_loss: 0.6975 - val_binary_crossentropy: 0.6975 Epoch 2/10

1/1 [==============================] - 0s 33ms/sample - loss: 0.6956 - binary_crossentropy: 0.6956 - val_loss: 0.6961 - val_binary_crossentropy: 0.6961 Epoch 3/10

1/1 [==============================] - 0s 35ms/sample - loss: 0.6892 - binary_crossentropy: 0.6892 - val_loss: 0.6948 - val_binary_crossentropy: 0.6948 Epoch 4/10

1/1 [==============================] - 0s 43ms/sample - loss: 0.6836 - binary_crossentropy: 0.6836 - val_loss: 0.6938 - val_binary_crossentropy: 0.6938 Epoch 5/10

1/1 [==============================] - 0s 58ms/sample - loss: 0.6779 - binary_crossentropy: 0.6779 - val_loss: 0.6928 - val_binary_crossentropy: 0.6928 Epoch 6/10

1/1 [==============================] - 0s 61ms/sample - loss: 0.6730 - binary_crossentropy: 0.6730 - val_loss: 0.6920 - val_binary_crossentropy: 0.6920 Epoch 7/10 2019-11-15 17:44:39.675306: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at iterator_ops.cc:893 : Not found: Resource AnonymousIterator/AnonymousIterator7/class tensorflow::data::IteratorResource does not exist. 2019-11-15 17:44:39.676532: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Not found: Resource AnonymousIterator/AnonymousIterator7/class tensorflow::data::IteratorResource does not exist. [[{{node IteratorGetNext}}]]

Traceback (most recent call last): 1/1 [==============================] - 0s 68ms/sample - loss: 0.6679 - binary_crossentropy: 0.6679 File "C:/Workspace/python/recommend_system/DeepCTR/examples/run_din.py", line 39, in history = model.fit(x, y, verbose=1, epochs=10, validation_split=0.5) File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 728, in fit use_multiprocessing=use_multiprocessing) File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 370, in fit total_epochs=1) File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 123, in run_one_epoch batch_outs = execution_function(iterator) File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 86, in execution_function distributed_function(input_fn)) File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 457, in call result = self._call(*args, **kwds) File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 494, in _call results = self._stateful_fn(*args, **kwds) File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py", line 1823, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py", line 1141, in _filtered_call self.captured_inputs) File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py", line 1224, in _call_flat ctx, args, cancellation_manager=cancellation_manager) File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py", line 511, in call ctx=ctx) File "C:\Software\Anaconda3\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.NotFoundError: Resource AnonymousIterator/AnonymousIterator7/class tensorflow::data::IteratorResource does not exist. [[node IteratorGetNext (defined at \Software\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]] [Op:__inference_distributed_function_4669]

Function call stack: distributed_function

HTTPSConnectionPool(host='pypi.python.org', port=443): Max retries exceeded with url: /pypi/deepctr/json (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000001E74A08F940>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。',))

Process finished with exit code 1 `

Operating environment(运行环境):

  • python version [3.6]
  • tensorflow version [2.0.0]
  • deepctr version [0.6.3]

DoubleYing avatar Nov 15 '19 12:11 DoubleYing

嗯,这个问题好奇怪的,以上错误是我在windows 上运行的结果,但是我放到linux服务器上运行就没有问题了。。。。 但以上错误发生的原因是啥呢。。。。

DoubleYing avatar Nov 15 '19 13:11 DoubleYing

我在MAC 遇到 同样问题

Alwaysproblem avatar Dec 30 '19 03:12 Alwaysproblem

但是 偶尔 可以跑通

Alwaysproblem avatar Dec 30 '19 03:12 Alwaysproblem

我更新到了 tensorflow-2.1.0-rc0 可以跑, 并出结果, 但是 依旧会有 这个错误

Alwaysproblem avatar Dec 30 '19 10:12 Alwaysproblem

刚才使用2.0跑除了这个问题,更新到了tf 2.1稳定版后就不报错了

TanTingyi avatar May 29 '20 05:05 TanTingyi