mask-rcnn-tf2 icon indicating copy to clipboard operation
mask-rcnn-tf2 copied to clipboard

lossnan问题?

Open ChenMaolong opened this issue 2 years ago • 21 comments

大佬,我跑你git代码和你的数据集出现lossnan问题,Epoch 00002: LearningRateScheduler reducing learning rate to 6e-06. Epoch 2/100 202/202 [==============================] - 106s 527ms/step - loss: nan - rpn_class_loss_loss: nan - rpn_bbox_loss_loss: nan - mrcnn_class_loss_loss: 1.0970 - mrcnn_bbox_loss_loss: 0.0000e+00 - mrcnn_mask_loss_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss_loss: nan - val_rpn_bbox_loss_loss: nan - val_mrcnn_class_loss_loss: 1.0962 - val_mrcnn_bbox_loss_loss: 0.0000e+00 - val_mrcnn_mask_loss_loss: 0.0000e+00 - lr: 6.0000e-06这是环境问题吗?我是tensorflow2.2.0,python3.6.13是cuda10.1,是环境问题吗还是什么原因?

ChenMaolong avatar Oct 16 '22 07:10 ChenMaolong

什么显卡

bubbliiiing avatar Oct 16 '22 14:10 bubbliiiing

什么显卡

笔记本3060显卡

ChenMaolong avatar Oct 16 '22 16:10 ChenMaolong

是笔记本带不动maskrcmn tf版本吗?我的环境安装教程用的你的博客,啥都一样

ChenMaolong avatar Oct 16 '22 16:10 ChenMaolong

版本不对应该

bubbliiiing avatar Oct 22 '22 16:10 bubbliiiing

cuda和cudnn不对

bubbliiiing avatar Oct 22 '22 16:10 bubbliiiing

你好up,我也遇到了同样的问题。我的电脑是4090,tensorflow-gpu=2.2.0,cuda=10.1,cudnn=7.4.5,python=3.8,就是按照您在csdn的教程配置的。在安装完成之后进行测试,print(tf.test.is_gpu_available())也是true。然而在训练的时候,无论是adam优化器还是sgd,也无论我的学习率设置的多么小(我都尝试0了),还是会在第一个epoch的第二步出现nan loss(第一轮的第一步会有一个loss,在20左右,很大。)具体报错如下:2023-08-09 22:12:04.204688: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session started. 1/38 [..............................] - ETA: 0s - loss: 17.3096 - rpn_class_loss_loss: 0.7027 - rpn_bbox_loss_loss: 4.3601 - mrcnn_class_loss_loss: 0.0000e+00 - mrcnn_bbox_loss_loss: 1.7061 - mrcnn_mask_loss_loss: 7.44312023-08-09 22:12:04.484461: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1479] CUPTI activity buffer flushed 2023-08-09 22:12:04.487804: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216] GpuTracer has collected 6222 callback api events and 6222 activity events.
2023-08-09 22:12:04.635549: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: logs\loss_2023_08_09_22_09_15\train\plugins\profile\2023_08_09_14_12_04 2023-08-09 22:12:04.718609: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to logs\loss_2023_08_09_22_09_15\train\plugins\profile\2023_08_09_14_12_04\DESKTOP-R6MU409.trace.json.gz 2023-08-09 22:12:04.729861: E tensorflow/core/profiler/utils/hardware_type_utils.cc:60] Invalid GPU compute capability. 2023-08-09 22:12:04.789777: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 1.205 ms

2023-08-09 22:12:04.977899: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: logs\loss_2023_08_09_22_09_15\train\plugins\profile\2023_08_09_14_12_04Dumped tool data for overview_page.pb to logs\loss_2023_08_09_22_09_15\train\plugins\profile\2023_08_09_14_12_04\DESKTOP-R6MU409.overview_page.pb Dumped tool data for input_pipeline.pb to logs\loss_2023_08_09_22_09_15\train\plugins\profile\2023_08_09_14_12_04\DESKTOP-R6MU409.input_pipeline.pb Dumped tool data for tensorflow_stats.pb to logs\loss_2023_08_09_22_09_15\train\plugins\profile\2023_08_09_14_12_04\DESKTOP-R6MU409.tensorflow_stats.pb Dumped tool data for kernel_stats.pb to logs\loss_2023_08_09_22_09_15\train\plugins\profile\2023_08_09_14_12_04\DESKTOP-R6MU409.kernel_stats.pb

2/38 [>.............................] - ETA: 14s - loss: nan - rpn_class_loss_loss: nan - rpn_bbox_loss_loss: 13.0844 - mrcnn_class_loss_loss: 132.9681 - mrcnn_bbox_loss_loss: 29.0181 - mrcnn_mask_loss_loss: 7.7721 WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.281999). Check your callbacks.
3/38 [=>............................] - ETA: 40s - loss: nan - rpn_class_loss_loss: nan - rpn_bbox_loss_loss: nan - mrcnn_class_loss_loss: 89.2423 - mrcnn_bbox_loss_loss:

已经在网络上找了很多方法来尝试了,都未尝奏效,请up大佬帮忙康康呀~~

xieyizi990430 avatar Aug 09 '23 14:08 xieyizi990430

image hello up,有在网上找一些solution,这个老外说的也许或许是我的case。我确实有很小的独立的区域,不知道这是否会有影响并且导致NaN呢。我的图片都是2048*3072的大小。虽然图片里面有很精细、很小的区域,但是标注都是非常细致的,轮廓的贴合度很高。~~

xieyizi990430 avatar Aug 09 '23 14:08 xieyizi990430

用cuda11那个

bubbliiiing avatar Aug 09 '23 15:08 bubbliiiing

版本问题

bubbliiiing avatar Aug 09 '23 15:08 bubbliiiing

image up up,可是tf的官网这么写的哎。。。。cuda11真的可以么。。。有推荐的版本配置么~~俺先去试一下子!!

xieyizi990430 avatar Aug 10 '23 07:08 xieyizi990430

up,今天我分别尝试了用tensorflow2.2.0配合cuda11、tensorflow2.4.0配合cuda11以及tensorflow2.6..0配合cuda11来跑你滴mask rcnn的代码~~用tensorflow2.2.0会出现nan的问题,剩下两个根本跑不了,可能是因为版本不兼容吧~~~~~·哇呀,科研路漫漫,我还要继续调试这些代码么或者进行修改么~~请up建议~~~~~~~~有4090机器却跑不起来可真闹心啊! image

xieyizi990430 avatar Aug 10 '23 16:08 xieyizi990430

哭哭!!!!

xieyizi990430 avatar Aug 13 '23 14:08 xieyizi990430

哭哭!!!!

cuda=11.0+tensorflow2.4.0就可以了

bubbliiiing avatar Aug 13 '23 14:08 bubbliiiing

你甚至可以尝试更高的0 0

bubbliiiing avatar Aug 13 '23 14:08 bubbliiiing

你可以直接试试在cmd conda install tensorflow-gpu

bubbliiiing avatar Aug 13 '23 14:08 bubbliiiing

!!这就去试!!!0 0 我宣布up又变得可爱了!

xieyizi990430 avatar Aug 13 '23 14:08 xieyizi990430

你是windows把?

bubbliiiing avatar Aug 13 '23 15:08 bubbliiiing

是的是的!(其实我试过了但是有点问题) 我现在再试一下

xieyizi990430 avatar Aug 13 '23 15:08 xieyizi990430

嗯,再试试 也可以试试conda install tensorflow-gpu

bubbliiiing avatar Aug 13 '23 16:08 bubbliiiing

up up! 我的代码动了!他动了!那个bug没有了! 我用了tensorflow-gpu2.4.0版本,配置了cuda11.0和cudnn8.0 其中还遇到了问题:I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 4294967295 在解决以后(按照链接的文章:https://dobromyslova.medium.com/making-work-tensorflow-with-nvidia-rtx-3090-on-windows-10-7a38e8e582bf) 发现代码能跑了,nan也没有了!感觉还是因为电脑配置太高的问题?我看很多是30系列的电脑有这些奇奇怪怪的问题,没想到4090也这样。看来以后40系列也得当30系列的来配置QAQ…… 总之很感谢up!!!

xieyizi990430 avatar Aug 15 '23 14:08 xieyizi990430

up up! 我的代码动了!他动了!那个bug没有了! 我用了tensorflow-gpu2.4.0版本,配置了cuda11.0和cudnn8.0 其中还遇到了问题:I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 4294967295 在解决以后(按照链接的文章:https://dobromyslova.medium.com/making-work-tensorflow-with-nvidia-rtx-3090-on-windows-10-7a38e8e582bf) 发现代码能跑了,nan也没有了!感觉还是因为电脑配置太高的问题?我看很多是30系列的电脑有这些奇奇怪怪的问题,没想到4090也这样。看来以后40系列也得当30系列的来配置QAQ…… 总之很感谢up!!!

能用就行

bubbliiiing avatar Aug 17 '23 16:08 bubbliiiing