mask-rcnn-tf2 icon indicating copy to clipboard operation
mask-rcnn-tf2 copied to clipboard

train.py 第一个epoch报错

Open mx12zhu opened this issue 3 years ago • 9 comments
trafficstars

博主,您好,运行train.py时,在第一个epoch会报错: Epoch 00001: LearningRateScheduler reducing learning rate to 3e-06. Epoch 1/100 2022-07-18 09:42:46.656945: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:906] Skipping loop optimization for Merge node with control input: mask_rcnn/proposal_targets/roi_assertion/AssertGuard/branch_executed/_8 Traceback (most recent call last): File "e:/zmx/mask-rcnn-tf2-master/train.py", line 295, in callbacks = callbacks File "D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\keras\engine\training.py", line 66, in _method_wrapper return method(self, *args, **kwargs) File "D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\keras\engine\training.py", line 848, in fit tmp_logs = train_function(iterator) File "D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\def_function.py", line 580, in call result = self._call(*args, **kwds) File "D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\def_function.py", line 644, in _call return self._stateless_fn(*args, **kwds) File "D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py", line 2420, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py", line 1665, in _filtered_call self.captured_inputs) File "D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py", line 1746, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py", line 598, in call ctx=ctx) File "D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[1] = 94081 is not in [0, 65472) [[node mask_rcnn/tf_op_layer_GatherV2_2/GatherV2_2 (defined at e:/zmx/mask-rcnn-tf2-master/train.py:295) ]] [Op:__inference_train_function_44647]

Function call stack: train_function

2022-07-18 09:42:51.062100: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]

我的配置是: scipy==1.4.1 numpy==1.19.2 matplotlib==3.2.1 opencv_python==4.2.0.34 tensorflow_cpu==2.2.0 tqdm==4.46.1 Pillow==8.2.0 h5py==2.10.0

mx12zhu avatar Jul 18 '22 01:07 mx12zhu

补充一下: GPU很拉用不上,所以想用cpu版的tf试一下,之前看博主的问题汇总,如果想用cpu训练装cpu版本的tf就行,配置好环境后运行train.py报了如上错误

mx12zhu avatar Jul 18 '22 03:07 mx12zhu

反正都是cpu,试试keras的?

bubbliiiing avatar Jul 19 '22 15:07 bubbliiiing

好嘞,谢谢博主

mx12zhu avatar Jul 20 '22 00:07 mx12zhu

博主您好,我还有些疑问麻烦您解答一下。如果我只用CPU训练,除了下载对应CPU版本的tf外,代码中有需要修改的地方吗? 比如:

  1. 这个部分?

if name == "main": #---------------------------------------------------------------------# # train_gpu 训练用到的GPU # 默认为第一张卡、双卡为[0, 1]、三卡为[0, 1, 2] # 在使用多GPU时,每个卡上的batch为总batch除以卡的数量。 #---------------------------------------------------------------------# train_gpu = [0,]

  1. 或者这个部分? #------------------------------------------------------#

    设置用到的显卡

    #------------------------------------------------------# os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(str(x) for x in train_gpu) ngpus_per_node = len(train_gpu)

    gpus = tf.config.experimental.list_physical_devices(device_type='GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True)

    if ngpus_per_node > 1: strategy = tf.distribute.MirroredStrategy() else: strategy = None print('Number of devices: {}'.format(ngpus_per_node))

mx12zhu avatar Jul 20 '22 09:07 mx12zhu

不用改

bubbliiiing avatar Jul 21 '22 14:07 bubbliiiing

我的也是同样的问题, numpy==1.18.4 matplotlib==3.2.1 scipy==1.4.1 opencv_python==4.2.0.34 tensorflow_gpu==2.2.0 tqdm==4.46.1 Pillow==8.2.0 h5py==2.10.0 完整配置: Package Version


absl-py 1.2.0 astunparse 1.6.3 backcall 0.2.0 backports.functools-lru-cache 1.6.4 cachetools 4.2.4 certifi 2022.6.15 charset-normalizer 2.1.1 colorama 0.4.5 cycler 0.11.0 Cython 0.29.32 debugpy 1.6.3 decorator 5.1.1 entrypoints 0.4 gast 0.3.3 google-auth 1.35.0 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 grpcio 1.47.0 h5py 2.10.0 idna 3.3 imageio 2.9.0 importlib-metadata 4.12.0 ipykernel 6.15.1 ipython 7.33.0 jedi 0.18.1 jupyter-client 7.3.4 jupyter_core 4.11.1 Keras-Preprocessing 1.1.2 kiwisolver 1.4.4 labelme 3.16.7 Markdown 3.4.1 MarkupSafe 2.1.1 matplotlib 3.2.1 matplotlib-inline 0.1.6 nest-asyncio 1.5.5 networkx 2.6.3 numpy 1.18.4 oauthlib 3.2.0 opencv-python 4.2.0.34 opt-einsum 3.3.0 packaging 21.3 parso 0.8.3 pickleshare 0.7.5 Pillow 8.2.0 pip 22.1.2 prompt-toolkit 3.0.30 protobuf 3.20.1 psutil 5.9.1 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycocotools-windows 2.0.0.1 Pygments 2.13.0 pyparsing 3.0.9 PyQt5 5.15.7 PyQt5-Qt5 5.15.2 PyQt5-sip 12.11.0 python-dateutil 2.8.2 PyWavelets 1.3.0 pywin32 303 PyYAML 6.0 pyzmq 23.2.0 QtPy 2.2.0 requests 2.28.1 requests-oauthlib 1.3.1 rsa 4.9 scikit-image 0.16.2 scipy 1.4.1 setuptools 63.4.1 six 1.16.0 tensorboard 2.2.2 tensorboard-plugin-wit 1.8.1 tensorflow-gpu 2.2.0 tensorflow-gpu-estimator 2.2.0 termcolor 1.1.0 tornado 6.2 tqdm 4.46.1 traitlets 5.3.0 typing_extensions 4.3.0 urllib3 1.26.12 wcwidth 0.2.5 Werkzeug 2.2.2 wheel 0.37.1 wincertstore 0.2 wrapt 1.14.1 zipp 3.8.1

Dittonal avatar Aug 23 '22 08:08 Dittonal

感觉可能gpu的环境有误

bubbliiiing avatar Aug 27 '22 16:08 bubbliiiing

博主,您好,运行 train.py 时,在第一个epoch会报错: Epoch 00001: LearningRateScheduler 将学习率降低到 3e-06。纪元 1/100 2022-07-18 09:42:46.656945:W 张量流/核心/抓斗器/优化器/loop_optimizer.cc:906] 与控制输入的合并节点的跳过循环优化:mask_rcnn/proposal_targets/roi_assertion/AssertGuard/branch_executed/_8 回溯(最近最后一次调用):文件 “e:/zmx/mask-rcnn-tf2-master/train.py”,第 295 行,回调 = 回调 文件 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\keras\engine\training.py”,第 66 行,在 _method_wrapper return method(self, *args, **kwargs) 文件 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\keras\engine\training.py”,第 848 行,在 fit tmp_logs = train_function(迭代器)文件中 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\def_function.py”,第 580 行, in call result = self._call(*args, kwds) File “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\def_function.py”,第 644 行,在 _call return self._stateless_fn(*args, kwds) 文件中 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py”,第 2420 行,在调用返回中 graph_function.filteredcall(args, kwargs) # pylint: disable=受保护的访问文件 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py”,第 1665 行,在 _filtered_call self.captured_inputs) 文件 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py”,第 1746 行,在 _call_flat ctx, args, cancellation_manager=cancellation_manager)) 文件 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py“,第 598 行,调用 ctx=ctx) 文件 ”D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\execute.py“,第 60 行,在 quick_execute 输入、attrs、num_outputs) tensorflow.python.framework.errors_impl中。InvalidArgumentError: indices[1] = 94081 不在 [0, 65472) [[node mask_rcnn/tf_op_layer_GatherV2_2/GatherV2_2 (定义于 e:/zmx/mask-rcnn-tf2-master/train.py:295) ]] [Op:__inference_train_function_44647]

函数调用堆栈:train_function

2022-07-18 09:42:51.062100:W 张量流/核心/内核/数据/generator_dataset_op.cc:103] 完成生成器数据集迭代器时出错:失败的前提条件:Python 解释器状态未初始化。该过程可能会终止。 [[{{node PyFunc}}]]

我的配置是: scipy==1.4.1 numpy==1.19.2 matplotlib==3.2.1 opencv_python==4.2.0.34 tensorflow_cpu==2.2.0 tqdm==4.46.1 枕头==8.2.0 h5py==2.10.0

您好,请问您的这个问题解决了吗? 我也是用cpu遇到这个报错

daixin0609 avatar Feb 07 '23 06:02 daixin0609

博主,您好,运行 train.py 时,在第一个epoch会报错: Epoch 00001: LearningRateScheduler 将学习率降低到 3e-06。纪元 1/100 2022-07-18 09:42:46.656945:W 张量流/核心/抓斗器/优化器/loop_optimizer.cc:906] 与控制输入的合并节点的跳过循环优化:mask_rcnn/proposal_targets/roi_assertion/AssertGuard/branch_executed/_8 回溯(最近最后一次调用):文件 “e:/zmx/mask-rcnn-tf2-master/train.py”,第 295 行,回调 = 回调 文件 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\keras\engine\training.py”,第 66 行,在 _method_wrapper return method(self, *args, **kwargs) 文件 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\keras\engine\training.py”,第 848 行,在 fit tmp_logs = train_function(迭代器)文件中 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\def_function.py”,第 580 行, in call result = self._call(*args, kwds) File “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\def_function.py”,第 644 行,在 _call return self._stateless_fn(*args, kwds) 文件中 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py”,第 2420 行,在调用返回中 graph_function.filteredcall(args, kwargs) # pylint: disable=受保护的访问文件 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py”,第 1665 行,在 _filtered_call self.captured_inputs) 文件 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py”,第 1746 行,在 _call_flat ctx, args, cancellation_manager=cancellation_manager)) 文件 “D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\function.py“,第 598 行,调用 ctx=ctx) 文件 ”D:\anaconda3\envs\tf2-cpu\lib\site-packages\tensorflow\python\eager\execute.py“,第 60 行,在 quick_execute 输入、attrs、num_outputs) tensorflow.python.framework.errors_impl中。InvalidArgumentError: indices[1] = 94081 不在 [0, 65472) [[node mask_rcnn/tf_op_layer_GatherV2_2/GatherV2_2 (定义于 e:/zmx/mask-rcnn-tf2-master/train.py:295) ]] [Op:__inference_train_function_44647] 函数调用堆栈:train_function 2022-07-18 09:42:51.062100:W 张量流/核心/内核/数据/generator_dataset_op.cc:103] 完成生成器数据集迭代器时出错:失败的前提条件:Python 解释器状态未初始化。该过程可能会终止。 [[{{node PyFunc}}]] 我的配置是: scipy==1.4.1 numpy==1.19.2 matplotlib==3.2.1 opencv_python==4.2.0.34 tensorflow_cpu==2.2.0 tqdm==4.46.1 枕头==8.2.0 h5py==2.10.0

您好,请问您的这个问题解决了吗? 我也是用cpu遇到这个报错

cpu的建议用keras的试试,tf2的代码没啥问题,感觉是内部反传的问题,不太懂为什么

bubbliiiing avatar Feb 09 '23 06:02 bubbliiiing