PaddleDetection
PaddleDetection copied to clipboard
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
问题确认 Search before asking
Bug组件 Bug Component
Training
Bug描述 Describe the Bug
配置文件内容:configs/ppyoloe_plus_crn_m_80e_coco.yml
_BASE_: [
'/data/PaddleDetection/configs/datasets/coco_detection.yml',
'/data/PaddleDetection/configs/runtime.yml',
'/data/PaddleDetection/configs/ppyoloe/_base_/optimizer_80e.yml',
'/data/PaddleDetection/configs/ppyoloe/_base_/ppyoloe_plus_crn.yml',
'/data/PaddleDetection/configs/ppyoloe/_base_/ppyoloe_plus_reader.yml',
]
num_classes: 33
TrainDataset:
!COCODataSet
image_dir: train
anno_path: annotations/train.json
dataset_dir: /data/work/dataset
data_fields: ['image', 'gt_bbox', 'gt_class', 'is_crowd']
EvalDataset:
!COCODataSet
image_dir: val
anno_path: annotations/val.json
dataset_dir: /data/work/dataset
TestDataset:
!ImageFolder
anno_path: annotations/val.json
dataset_dir: /data/work/dataset
TrainReader:
batch_size: 8
EvalReader:
batch_size: 2
log_iter: 50 #100
save_dir: /data/work/output
snapshot_epoch: 5
epoch: 70 #80
LearningRate:
base_lr: 0.0000625 #0.0000125 #0.001
weights: /data/work/output/ppyoloe_plus_crn_m_80e_coco/model_final
pretrain_weights: https://paddledet.bj.bcebos.com/models/ppyoloe_plus_crn_m_80e_coco.pdparams
depth_mult: 0.67
width_mult: 0.75
执行命令:
export CUDA_VISIBLE_DEVICES=0
python tools/train.py -c configs/ppyoloe_plus_crn_m_80e_coco.yml --amp --eval --use_vdl=true --vdl_log_dir=/data/work/option-number/logs
报错信息如下:
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
-------
-------
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Traceback (most recent call last):
File "tools/train.py", line 172, in <module>
main()
File "tools/train.py", line 168, in main
run(FLAGS, cfg)
File "tools/train.py", line 132, in run
trainer.train(FLAGS.eval)
File "/data/PaddleDetection/ppdet/engine/trainer.py", line 485, in train
outputs = model(data)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
return self.forward(*inputs, **kwargs)
File "/data/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
out = self.get_loss()
File "/data/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 124, in get_loss
return self._forward()
File "/data/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 88, in _forward
yolo_losses = self.yolo_head(neck_feats, self.inputs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
return self.forward(*inputs, **kwargs)
File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 219, in forward
return self.forward_train(feats, targets)
File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 164, in forward_train
], targets)
File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 356, in get_loss
assigned_scores_sum)
File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 269, in _bbox_loss
if num_pos > 0:
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 680, in __bool__
return self.__nonzero__()
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 673, in __nonzero__
return bool(np.all(self.numpy() > 0))
OSError: (External) CUDA error(719), unspecified launch failure.
[Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)
执行
export FLAGS_check_nan_inf=1
输出内容:
[01/16 15:08:41] ppdet.engine INFO: Epoch: [4] [100/827] learning_rate: 0.000052 loss: 1.833632 loss_cls: 0.953054 loss_iou: 0.158906 loss_dfl: 0.899485 loss_l1: 0.325665 eta: 3:05:50 batch_cost: 0.1956 data_cost: 0.0002 ips: 40.9046 images/s
[01/16 15:08:52] ppdet.engine INFO: Epoch: [4] [150/827] learning_rate: 0.000052 loss: 1.754388 loss_cls: 0.963562 loss_iou: 0.153431 loss_dfl: 0.845342 loss_l1: 0.293998 eta: 3:05:34 batch_cost: 0.1968 data_cost: 0.0002 ips: 40.6550 images/s
numel:648 idx:544 value:23.359375
numel:648 idx:545 value:-18.828125
numel:648 idx:546 value:-25.531250
numel:648 idx:27 value:-inf
numel:648 idx:28 value:-inf
numel:648 idx:351 value:-inf
In block 0, there has 0,54,594 nan,inf,num
Error: /paddle/paddle/fluid/framework/details/nan_inf_utils_detail.cu:105 Assertion `false` failed. ===ERROR: in [op=conv2d_grad] [tensor=] find nan or inf===
Traceback (most recent call last):
File "tools/train.py", line 172, in <module>
main()
File "tools/train.py", line 168, in main
run(FLAGS, cfg)
File "tools/train.py", line 132, in run
trainer.train(FLAGS.eval)
File "/data/PaddleDetection/ppdet/engine/trainer.py", line 491, in train
scaler.minimize(self.optimizer, scaled_loss)
File "/usr/local/lib/python3.7/dist-packages/paddle/amp/grad_scaler.py", line 157, in minimize
return super(GradScaler, self).minimize(optimizer, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/amp/loss_scaler.py", line 222, in minimize
self._unscale(optimizer)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/amp/loss_scaler.py", line 310, in _unscale
self._found_inf = self._temp_found_inf_fp16 or self._temp_found_inf_fp32
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 680, in __bool__
return self.__nonzero__()
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 673, in __nonzero__
return bool(np.all(self.numpy() > 0))
OSError: (External) CUDA error(719), unspecified launch failure.
[Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)
复现环境 Environment
os: ubuntu 20.04 docker image : paddle:2.4.1-gpu-cuda11.7-cudnn8.4-trt8.4 单卡, NVIDIA GeForce RTX 2080 Ti ,11G显存。 paddlepaddle:2.4.1 PaddleDetection:2.5.0
Bug描述确认 Bug description confirmation
- [X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.
是否愿意提交PR? Are you willing to submit a PR?
- [X] 我愿意提交PR!I'd like to help by submitting a PR!
你先试一下不用amp训练看看
按照显卡数量和bs调整学习率可以解决
按照显卡数量和bs调整学习率可以解决
我贴的配置,就是调整过的。
你先试一下不用amp训练看看
@ghostxsl ,按你说的,去掉amp,还是报一样的错误。
那应该是paddle框架算子的bug,你换个paddle + python的版本试一下
可能是paddle框架与不同平台兼容性有问题,可以参考 #6723
不行,我更换到python3.9,也是报类似的错误。
https://github.com/PaddlePaddle/PaddleDetection/issues/6723#issuecomment-1326083748 你先试下这里的单测用例,看看是否在你的环境下也会出现类似的bug
#6723 (comment) 你先试下这里的单测用例,看看是否在你的环境下也会出现类似的bug
我用这个代码测试了,没有得出同样的输出信息。多次执行,只返回如下信息:
Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
[0, 1, 2])
看错误信息,他们好像不是一个地方产生的。
是不是和这个参数有关?
默认的是: pretrain_weights: https://bj.bcebos.com/v1/paddledet/models/pretrained/ppyoloe_crn_s_obj365_pretrained.pdparams
我使用的是: pretrain_weights: https://paddledet.bj.bcebos.com/models/ppyoloe_plus_crn_m_80e_coco.pdparams
我尝试在aistudio上执行,目前还没报错。
aistudio上的cuda版本是11.2。 我估计是paddlepaddle和11.7的兼容问题。
等我在aistudio上跑完看看,是否还报错,如果不报错,我再降级我自己的环境试试。
我测试了,同样的数据集和配置参数
在aistudio上完全正常的跑完。
aistudio的参数:
aistudio@jupyter-2276827-4958141:~$ nvidia-smi
Wed Jan 18 09:09:08 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:04:00.0 Off | 0 |
| N/A 37C P0 53W / 300W | 763MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
aistudio@jupyter-2276827-4958141:~$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-bf77909a-5ace-6815-3a98-7b575241c3bf)
aistudio@jupyter-2276827-4958141:~$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
aistudio@jupyter-2276827-4958141:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
pip list
aistudio@jupyter-2276827-4958141:~$ pip list
Package Version
------------------------------ ---------------
absl-py 0.8.1
alembic 1.8.1
altair 4.2.0
anyio 3.6.1
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
aspy.yaml 1.3.0
astor 0.8.1
astroid 2.4.1
async-generator 1.10
attrs 22.1.0
audioread 2.1.8
autopep8 1.6.0
Babel 2.8.0
backcall 0.1.0
backports.zoneinfo 0.2.1
bce-python-sdk 0.8.53
beautifulsoup4 4.11.1
bleach 5.0.1
blinker 1.5
cachetools 4.0.0
certifi 2019.9.11
certipy 0.1.3
cffi 1.15.1
cfgv 2.0.1
chardet 3.0.4
click 8.0.4
cloudpickle 1.6.0
cma 2.7.0
colorama 0.4.4
colorlog 4.1.0
commonmark 0.9.1
cryptography 38.0.1
cycler 0.10.0
Cython 0.29
debugpy 1.6.0
decorator 4.4.2
defusedxml 0.7.1
dill 0.3.3
easydict 1.9
entrypoints 0.4
et-xmlfile 1.0.1
fastjsonschema 2.16.1
filelock 3.0.12
filterpy 1.4.5
fire 0.5.0
flake8 4.0.1
Flask 1.1.1
Flask-Babel 1.0.0
Flask-Cors 3.0.8
forbiddenfruit 0.1.3
funcsigs 1.0.2
future 0.18.0
gast 0.3.3
gitdb 4.0.5
GitPython 3.1.14
google-auth 1.10.0
google-auth-oauthlib 0.4.1
graphviz 0.13
greenlet 1.1.3
grpcio 1.35.0
gunicorn 20.0.4
gym 0.12.1
h5py 2.9.0
identify 1.4.10
idna 2.8
imageio 2.6.1
imageio-ffmpeg 0.3.0
importlib-metadata 4.2.0
importlib-resources 5.9.0
ipykernel 6.9.1
ipython 7.34.0
ipython-genutils 0.2.0
ipywidgets 7.6.5
isort 4.3.21
itsdangerous 1.1.0
jdcal 1.4.1
jedi 0.17.2
jieba 0.42.1
Jinja2 3.0.0
joblib 0.14.1
JPype1 0.7.2
json5 0.9.5
jsonschema 4.16.0
jupyter-archive 3.2.1
jupyter_client 7.3.5
jupyter-core 4.11.1
jupyter-lsp 1.5.1
jupyter-server 1.16.0
jupyter-telemetry 0.1.0
jupyterhub 1.3.0
jupyterlab 3.4.5
jupyterlab-language-pack-zh-CN 3.4.post1
jupyterlab-pygments 0.2.2
jupyterlab-server 2.10.3
jupyterlab-widgets 3.0.3
kiwisolver 1.1.0
lap 0.4.0
lazy-object-proxy 1.4.3
librosa 0.7.2
lightgbm 3.1.1
llvmlite 0.31.0
lxml 4.9.1
Mako 1.2.2
Markdown 3.1.1
MarkupSafe 2.0.1
matplotlib 2.2.3
matplotlib-inline 0.1.6
mccabe 0.6.1
mistune 0.8.4
more-itertools 7.2.0
motmetrics 1.4.0
moviepy 1.0.1
multiprocess 0.70.11.1
nbclassic 0.3.1
nbclient 0.5.13
nbconvert 6.4.4
nbformat 5.5.0
nest-asyncio 1.5.5
netifaces 0.10.9
networkx 2.4
nltk 3.4.5
nodeenv 1.3.4
notebook 5.7.0
numba 0.48.0
numpy 1.19.5
oauthlib 3.1.0
objgraph 3.4.1
opencv-python 4.6.0.66
openpyxl 3.0.5
opt-einsum 3.3.0
packaging 21.3
paddle-bfloat 0.1.7
paddle2onnx 1.0.0
paddledet 2.5.0
paddlefsl 1.0.0
paddlehub 2.3.0
paddlenlp 2.1.1
paddlepaddle-gpu 2.3.2.post112
pamela 1.0.0
pandas 1.1.5
pandocfilters 1.5.0
parl 1.4.1
parso 0.7.1
pathlib 1.0.1
pexpect 4.7.0
pickleshare 0.7.5
Pillow 8.2.0
pip 22.1.2
pkgutil_resolve_name 1.3.10
plotly 5.8.0
pluggy 1.0.0
pre-commit 1.21.0
prettytable 0.7.2
proglog 0.1.9
prometheus-client 0.14.1
prompt-toolkit 2.0.10
protobuf 3.20.0
psutil 5.7.2
ptyprocess 0.7.0
py4j 0.10.9.2
pyarrow 10.0.1
pyasn1 0.4.8
pyasn1-modules 0.2.7
pybboxes 0.1.1
pyclipper 1.3.0.post4
pycocotools 2.0.6
pycodestyle 2.8.0
pycparser 2.21
pycryptodome 3.9.9
pydeck 0.8.0
pydocstyle 5.0.2
pyflakes 2.4.0
pyglet 1.4.5
Pygments 2.13.0
pyhumps 3.8.0
pylint 2.5.2
Pympler 1.0.1
pynvml 8.0.4
pyOpenSSL 22.0.0
pyparsing 3.0.9
pypmml 0.9.11
pyrsistent 0.18.1
python-dateutil 2.8.2
python-json-logger 2.0.4
python-jsonrpc-server 0.3.4
python-language-server 0.33.0
python-lsp-jsonrpc 1.0.0
python-lsp-server 1.5.0
pytz 2019.3
pytz-deprecation-shim 0.1.0.post0
PyYAML 5.1.2
pyzmq 23.2.1
rarfile 3.1
recordio 0.1.7
requests 2.24.0
requests-oauthlib 1.3.0
resampy 0.2.2
rich 12.6.0
rope 0.17.0
rsa 4.0
ruamel.yaml 0.17.21
ruamel.yaml.clib 0.2.6
sahi 0.10.1
scikit-learn 0.24.2
scipy 1.6.3
seaborn 0.10.0
semver 2.13.0
Send2Trash 1.8.0
sentencepiece 0.1.96
seqeval 1.2.2
setuptools 56.2.0
shapely 2.0.0
shellcheck-py 0.7.1.1
simplegeneric 0.8.1
six 1.16.0
sklearn 0.0
smmap 3.0.5
sniffio 1.3.0
snowballstemmer 2.0.0
SoundFile 0.10.3.post1
soupsieve 2.3.2.post1
SQLAlchemy 1.4.41
streamlit 1.13.0
streamlit-image-comparison 0.0.3
tabulate 0.8.3
tb-nightly 1.15.0a20190801
tb-paddle 0.3.6
tenacity 8.0.1
tensorboard 2.1.0
tensorboardX 1.8
termcolor 1.1.0
terminado 0.15.0
terminaltables 3.1.10
testpath 0.4.2
threadpoolctl 2.1.0
tinycss2 1.1.1
toml 0.10.0
toolz 0.12.0
tornado 6.2
tqdm 4.64.1
traitlets 5.4.0
typed-ast 1.4.1
typeguard 3.0.0b2
typing_extensions 4.3.0
tzdata 2022.7
tzlocal 4.2
ujson 1.35
urllib3 1.25.6
validators 0.20.0
virtualenv 16.7.9
visualdl 2.4.0
watchdog 2.2.0
wcwidth 0.1.7
webencodings 0.5.1
websocket-client 1.4.1
Werkzeug 0.16.0
whatthepatch 1.0.2
wheel 0.36.2
widgetsnbextension 3.5.2
wrapt 1.12.1
xarray 0.16.2
xgboost 1.3.3
xlrd 1.2.0
xmltodict 0.13.0
yapf 0.26.0
zipp 3.8.1
[notice] A new release of pip available: 22.1.2 -> 22.3.1
[notice] To update, run: pip install --upgrade pip
本地cnda 换成 11.2 ,依旧报错
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Traceback (most recent call last):
File "tools/train.py", line 172, in <module>
main()
File "tools/train.py", line 168, in main
run(FLAGS, cfg)
File "tools/train.py", line 132, in run
trainer.train(FLAGS.eval)
File "/data/PaddleDetection/ppdet/engine/trainer.py", line 485, in train
outputs = model(data)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
return self.forward(*inputs, **kwargs)
File "/data/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
out = self.get_loss()
File "/data/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 124, in get_loss
return self._forward()
File "/data/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 88, in _forward
yolo_losses = self.yolo_head(neck_feats, self.inputs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
return self.forward(*inputs, **kwargs)
File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 219, in forward
return self.forward_train(feats, targets)
File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 164, in forward_train
], targets)
File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 356, in get_loss
assigned_scores_sum)
File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 269, in _bbox_loss
if num_pos > 0:
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 680, in __bool__
return self.__nonzero__()
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 673, in __nonzero__
return bool(np.all(self.numpy() > 0))
OSError: (External) CUDA error(719), unspecified launch failure.
[Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)