Paddle
Paddle copied to clipboard
同样代码在CPU上可以正常训练,而在GPU上训练过程中报错
bug描述 Describe the Bug
在cpu上模型可以正常训练,但在gpu上训练时,中间产生nan报错:
[2022-08-04 08:38:08,797] [paddlets.models.dl.paddlepaddle.callbacks.callbacks] [INFO] epoch 000| loss: 7.798771| val_0_mse: 5.539458| val_0_mae: 1.806103| 0:00:02s
[2022-08-04 08:38:09,079] [paddlets.models.dl.paddlepaddle.callbacks.callbacks] [INFO] epoch 001| loss: 4.379002| val_0_mse: 1.134817| val_0_mae: 0.832125| 0:00:02s
[2022-08-04 08:38:09,313] [paddlets.models.dl.paddlepaddle.callbacks.callbacks] [INFO] epoch 002| loss: 4.343719| val_0_mse: 10.029250| val_0_mae: 2.479812| 0:00:03s
[2022-08-04 08:38:09,524] [paddlets.models.dl.paddlepaddle.callbacks.callbacks] [INFO] epoch 003| loss: 4.042698| val_0_mse: 0.698713| val_0_mae: 0.658151| 0:00:03s
[2022-08-04 08:38:09,743] [paddlets.models.dl.paddlepaddle.callbacks.callbacks] [INFO] epoch 004| loss: 2.243271| val_0_mse: 1.619683| val_0_mae: 0.961455| 0:00:03s
Traceback (most recent call last):
File "repro_nhits.py", line 203, in <module>
main()
File "repro_nhits.py", line 196, in main
fit_time, predict_time = run_one_model(models["nhits"])
File "repro_nhits.py", line 38, in run_one_model
model.fit(ts_train_scaled, ts_val_scaled)
File "/usr/local/lib/python3.7/dist-packages/paddlets/models/dl/paddlepaddle/paddle_base_impl.py", line 321, in fit
self._fit(train_dataloader, valid_dataloaders)
File "/usr/local/lib/python3.7/dist-packages/paddlets/models/dl/paddlepaddle/paddle_base_impl.py", line 351, in _fit
self._predict_epoch(eval_name, valid_dataloader)
File "/usr/local/lib/python3.7/dist-packages/paddlets/models/dl/paddlepaddle/paddle_base_impl.py", line 469, in _predict_epoch
metrics_logs = self._metric_container_dict[name](y_true, scores)
File "/usr/local/lib/python3.7/dist-packages/paddlets/metrics/metrics.py", line 166, in __call__
res = metric.metric_fn(y_true, y_score)
File "/usr/local/lib/python3.7/dist-packages/paddlets/metrics/utils.py", line 41, in wrapper
return func(obj, y_true, y_score)
File "/usr/local/lib/python3.7/dist-packages/paddlets/metrics/metrics.py", line 49, in metric_fn
return metrics.mean_squared_error(y_true, y_score)
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_regression.py", line 439, in mean_squared_error
y_true, y_pred, multioutput
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_regression.py", line 96, in _check_reg_targets
y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)
File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 801, in check_array
_assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 117, in _assert_all_finite
type_err, msg_dtype if msg_dtype is not None else X.dtype
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
其他补充信息 Additional Supplementary Information
No response
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~
Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day!
从报错信息看不出什么问题 可以提供下版本信息和最小可复现代码吗?
@yaoxuefeng6 最小可复现代码:
运行环境:
-
paddle提供的 docker 镜像 : registry.baidubce.com/paddlepaddle/paddle:2.3.1-gpu-cuda11.2-cudnn8
-
GPU型号:Nvidia A30
-
用于复现问题的数据下载链接: https://github.com/KeHuoBot/storage/raw/main/paddle_dataset_train
问题描述: 同一个输入tensor,已知其取值范围正常。对该tensor调用相同的 paddle.nn.functional.interpolate 接口后:在CPU上输出无异常,但是在GPU上计算结果中包含大量 inf 值,不符合预期。希望排查原因。
用于复现的代码简述: 问题在_NHiTSModule网络的forward函数中复现(即 149行 - 162行, 被“ISSUE_START” 和 “ISSUE_END” 包围的部分)。forward输入的tensor(即theta_backcast变量)中没有 inf 值,整个tensor中的取值范围都很正常。但是使用GPU计算后,输出x_hat变量中就出现了inf值。
代码:
# !/usr/bin/env python3
# -*- coding: UTF-8 -*-
import pickle
from typing import List, Dict, Optional, Tuple
import numpy as np
import paddle
from paddle import nn
import paddle.nn.functional as F
ACTIVATIONS = [
"ReLU",
"RReLU",
"PReLU",
"ELU",
"Softplus",
"Tanh",
"SELU",
"LeakyReLU",
"Sigmoid",
"GELU",
]
class _Block(nn.Layer):
def __init__(
self,
in_chunk_len: int,
out_chunk_len: int,
in_chunk_len_flat: int,
target_dim: int,
known_cov_dim: int,
observed_cov_dim: int,
num_layers: int,
layer_width: int,
pooling_kernel_size: int,
n_freq_downsample: int,
batch_norm: bool,
dropout: float,
activation: str,
MaxPool1d: bool,
):
super().__init__()
self._in_chunk_len = in_chunk_len
self._out_chunk_len = out_chunk_len
self._target_dim = target_dim
self._activation = getattr(nn, activation)()
n_theta_backcast = max(in_chunk_len // n_freq_downsample * target_dim, 1)
n_theta_forecast = max(out_chunk_len // n_freq_downsample * target_dim, 1)
# pooling layer
pool1d = nn.MaxPool1D if MaxPool1d else nn.AvgPool1D
self.pooling_layer = pool1d(
kernel_size=pooling_kernel_size,
stride=pooling_kernel_size,
ceil_mode=True,
)
# layer widths
in_len = int(np.ceil(in_chunk_len / pooling_kernel_size)) * (target_dim + known_cov_dim + observed_cov_dim) + \
int(np.ceil(out_chunk_len / pooling_kernel_size)) * known_cov_dim
layer_widths = [in_len] + [layer_width] * num_layers
# FC layers
layers = []
for i in range(num_layers):
layers.append(
nn.Linear(
in_features=layer_widths[i],
out_features=layer_widths[i + 1],
)
)
layers.append(self._activation)
if batch_norm:
layers.append(nn.BatchNorm1d(num_features=layer_widths[i + 1]))
if dropout > 0:
layers.append(nn.Dropout(p=dropout))
self.layers = nn.Sequential(*layers)
# Fully connected layer producing forecast/backcast expansion coeffcients (waveform generator parameters).
# The coefficients are emitted for each parameter of the likelihood for the forecast.
self.backcast_linear_layer = nn.Linear(
in_features=layer_width, out_features=n_theta_backcast
)
self.forecast_linear_layer = nn.Linear(
in_features=layer_width, out_features=n_theta_forecast
)
def forward(
self,
backcast: paddle.Tensor,
known_cov: paddle.Tensor,
observed_cov: paddle.Tensor
) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""
forward block.
Args:
backcast: past target, shape: [batch_size, in_chunk_len, target_dim]
known_cov: known covariates, shape: [batch_size, in_chunk_len + target_length, known_cov_dim]
observed_cov: observed covariates, shape: [batch_size, in_chunk_len, observed_cov_dim]
Returns:
x_hat: approximation of backcast on specific frequency, shape [batch_size, in_chunk_len, target_dim]
y_hat: tensor containing the forward forecast of the block, shape [batch_size, out_chunk_len, target_dim]
"""
# compose feature x
batch_size = backcast.shape[0]
# concat backcast, known_cov, observed_cov if any;
past_feature = [backcast]
future_feature = None
if known_cov is not None:
past_feature.append(known_cov[:, :self._in_chunk_len, :])
future_feature = known_cov[:, self._in_chunk_len:, :].transpose(perm=[0, 2, 1])
if observed_cov is not None:
past_feature.append(observed_cov)
past_feature = paddle.concat(x=past_feature, axis=2).transpose(perm=[0, 2, 1]) # (N,C,L)
# pooling layer
x = self.pooling_layer(past_feature).reshape([batch_size, -1])
if future_feature is not None:
x_ = self.pooling_layer(future_feature).reshape([batch_size, -1])
x = paddle.concat([x, x_], axis=1)
# fully connected layer stack
x = self.layers(x)
# forked linear layers producing waveform generator parameters
theta_backcast = self.backcast_linear_layer(x) # in_chunk_len * target_dim
theta_forecast = self.forecast_linear_layer(x) # out_chunk_len * target_dim
# set the expansion coefs in last dimension for the forecasts
theta_forecast = theta_forecast.reshape((batch_size, self._target_dim, -1))
# set the expansion coefs in last dimension for the backcasts
theta_backcast = theta_backcast.reshape((batch_size, self._target_dim, -1))
# interpolate both backcast and forecast from the theta_backcast and theta_forecast
x_hat = F.interpolate(
theta_backcast, size=[self._in_chunk_len], mode="linear", data_format='NCW'
)
y_hat = F.interpolate(
theta_forecast, size=[self._out_chunk_len], mode="linear", data_format='NCW'
)
################## ISSUE_START ###########
if paddle.isinf(x_hat).numpy().any():
print("all inf index of x_hat:")
# 153行 np.where 打印信息能显示出所有等于 inf 值的下标索引:
print(np.where(paddle.isinf(x_hat).numpy()))
# 下面的打印max / min / abs_min 信息,保证了输入的 tensor 取值范围很正常,没有很大、很小的值:
abs_theta_backcast = paddle.abs(theta_backcast)
print(
"max(theta_backcast) = %s, min(theta_backcast) = %s, abs_min(theta_backcast) = %s" %
(paddle.max(theta_backcast), paddle.min(theta_backcast), paddle.min(abs_theta_backcast))
)
# 遇到问题直接退出程序.
exit(1)
################## ISSUE_END ###########
x_hat = paddle.transpose(x_hat, perm=[0, 2, 1])
y_hat = paddle.transpose(y_hat, perm=[0, 2, 1])
return x_hat, y_hat
class _Stack(nn.Layer):
"""
Stack implementation of the NHiTS architecture, comprises multiple basic blocks.
Args:
in_chunk_len: The length of input sequence fed to the model.
out_chunk_len: The length of the forecast of the model.
in_chunk_len_flat: The length of the flattened input sequence(produced by concatenating past_target, known_cov, observed_cov) fed to the model.
num_blocks: The number of blocks making up this stack.
num_layers: The number of fully connected layers preceding the final forking layers in each block.
layer_width: The number of neurons that make up each fully connected layer in each block.
target_dim: The dimension of target.
known_cov_dim(int): The number of known covariates.
observed_cov_dim(int): The number of observed covariates.
pooling_kernel_size: The kernel size for the initial pooling layer.
n_freq_downsample: The factor by which to downsample time at the output (before interpolating).
batch_norm: Whether to use batch norm.
dropout: Dropout probability.
activation: The activation function of encoder/decoder intermediate layer.
MaxPool1d: Whether to use MaxPool1d pooling, False uses AvgPool1d.
"""
def __init__(
self,
in_chunk_len: int,
out_chunk_len: int,
in_chunk_len_flat: int,
num_blocks: int,
num_layers: int,
layer_width: int,
target_dim: int,
known_cov_dim: int,
observed_cov_dim: int,
pooling_kernel_sizes: Tuple[int],
n_freq_downsample: Tuple[int],
batch_norm: bool,
dropout: float,
activation: str,
MaxPool1d: bool,
):
super().__init__()
self.in_chunk_len = in_chunk_len
self.out_chunk_len = out_chunk_len
self._target_dim = target_dim
# TODO: leave option to share weights across blocks?
self._blocks_list = [
_Block(
in_chunk_len,
out_chunk_len,
in_chunk_len_flat,
target_dim,
known_cov_dim,
observed_cov_dim,
num_layers,
layer_width,
pooling_kernel_sizes[i],
n_freq_downsample[i],
batch_norm=(
batch_norm and i == 0
), # batch norm only on first block of first stack
dropout=dropout,
activation=activation,
MaxPool1d=MaxPool1d,
)
for i in range(num_blocks)
]
self._blocks = nn.LayerList(self._blocks_list)
def forward(
self,
backcast: paddle.Tensor,
known_cov: paddle.Tensor,
observed_cov: paddle.Tensor
) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""
forward stack.
Args:
backcast(paddle.Tensor): past target, shape: [batch_size, in_chunk_len, target_dim].
past_covariate(paddle.Tensor): past known covariate, shape: [batch_size, in_chunk_len, covariate_dim].
future_covariate(paddle.Tensor): future known covariate, shape: [batch_size, out_chunk_len, covariate_dim].
Returns:
stack_residual: residual tensor of backcast, shape [batch_size, in_chunk_len, target_dim].
stack_forecast tensor containing the forward forecast of the stack, shape [batch_size, out_chunk_len].
"""
# init stack_forecast as paddle.zeros
stack_forecast = paddle.zeros(
shape=(backcast.shape[0], self.out_chunk_len, self._target_dim),
dtype=backcast.dtype,
)
for block in self._blocks_list:
# pass input through block
x_hat, y_hat = block(backcast, known_cov, observed_cov)
# add block forecast to stack forecast
stack_forecast = stack_forecast + y_hat
# subtract backcast from input to produce residual
backcast = backcast - x_hat
stack_residual = backcast
return stack_residual, stack_forecast
class _NHiTSModule(nn.Layer):
"""
Implementation of NHiTS, cover multi-targets, known_covariates, observed_covariates.
"""
def __init__(
self,
in_chunk_len: int,
out_chunk_len: int,
target_dim: int,
known_cov_dim: int,
observed_cov_dim: int,
num_stacks: int,
num_blocks: int,
num_layers: int,
layer_widths: List[int],
pooling_kernel_sizes: Optional[Tuple[Tuple[int]]],
n_freq_downsample: Optional[Tuple[Tuple[int]]],
batch_norm: bool,
dropout: float,
activation: str,
MaxPool1d: bool,
):
super().__init__()
self._known_cov_dim = known_cov_dim
self._observed_cov_dim = observed_cov_dim
self._target_dim = target_dim
self._target_length = out_chunk_len
input_dim = target_dim + known_cov_dim + observed_cov_dim
self._in_chunk_len_multi = in_chunk_len * input_dim + out_chunk_len * known_cov_dim
self._pooling_kernel_sizes, self._n_freq_downsample = self._check_pooling_downsampling(
pooling_kernel_sizes,
n_freq_downsample,
in_chunk_len,
out_chunk_len,
num_blocks,
num_stacks
)
self._stacks_list = [
_Stack(
in_chunk_len,
out_chunk_len,
self._in_chunk_len_multi,
num_blocks,
num_layers,
layer_widths[i],
target_dim,
known_cov_dim,
observed_cov_dim,
self._pooling_kernel_sizes[i],
self._n_freq_downsample[i],
batch_norm=(batch_norm and i == 0), # batch norm only on the first block of the first stack
dropout=dropout,
activation=activation,
MaxPool1d=MaxPool1d,
)
for i in range(num_stacks)
]
self._stacks = nn.LayerList(self._stacks_list)
self._stacks_list[-1]._blocks[-1].backcast_linear_layer.stop_gradient = True
def _check_pooling_downsampling(
self,
pooling_kernel_sizes: Optional[Tuple[Tuple[int]]],
n_freq_downsample: Optional[Tuple[Tuple[int]]],
in_len: int,
out_len: int,
num_blocks: int,
num_stacks: int
):
def _check_sizes(tup, name):
pass
if pooling_kernel_sizes is None:
# make stacks handle different frequencies
# go from in_len/2 to 1 in num_stacks steps:
max_v = max(in_len // 2, 1)
pooling_kernel_sizes = tuple(
(max(int(v), 1),) * num_blocks
for v in max_v // np.geomspace(1, max_v, num_stacks)
)
else:
# check provided pooling format
_check_sizes(pooling_kernel_sizes, "`pooling_kernel_sizes`")
if n_freq_downsample is None:
# go from out_len/2 to 1 in num_stacks steps:
max_v = max(out_len // 2, 1)
n_freq_downsample = tuple(
(max(int(v), 1),) * num_blocks
for v in max_v // np.geomspace(1, max_v, num_stacks)
)
else:
# check provided downsample format
_check_sizes(n_freq_downsample, "`n_freq_downsample`")
return pooling_kernel_sizes, n_freq_downsample
def forward(
self,
data: Dict[str, paddle.Tensor]
) -> paddle.Tensor:
backcast = data["past_target"]
known_cov = data["known_cov"] if self._known_cov_dim > 0 else None
observed_cov = data["observed_cov"] if self._observed_cov_dim > 0 else None
# init forecast tensor
forecast = paddle.zeros(
shape=(backcast.shape[0], self._target_length, self._target_dim))
for stack_index, stack in enumerate(self._stacks_list):
# compute stack output
stack_residual, stack_forecast = stack(backcast, known_cov, observed_cov)
# accumulate stack_forecast to final output
forecast = forecast + stack_forecast
# set current stack residual as input for next stack
backcast = stack_residual
return forecast
def main():
batch_size = 512
max_epochs = 10
train_dataset_filename = "paddle_dataset_train"
with open(train_dataset_filename, "rb") as f:
train_dataset = pickle.load(f)
train_dataloader = paddle.io.DataLoader(dataset=train_dataset, batch_size=batch_size)
# repro the issue here.
feature_day_num = 3
network = _NHiTSModule(
in_chunk_len=feature_day_num * 24,
out_chunk_len=24,
target_dim=1,
known_cov_dim=0,
observed_cov_dim=11,
num_stacks=3,
num_blocks=3,
num_layers=2,
layer_widths=[512, 512, 512],
pooling_kernel_sizes=None,
n_freq_downsample=None,
batch_norm=False,
dropout=0.1,
activation="ReLU",
MaxPool1d=True
)
optimizer = paddle.optimizer.Adam(learning_rate=1e-4, parameters=network.parameters())
for epoch_idx in range(max_epochs):
network.train()
for batch_idx, curr_batch in enumerate(train_dataloader):
x_train_batch = curr_batch
for k in x_train_batch:
x_train_batch[k] = x_train_batch[k].astype("float32")
y_train_batch = x_train_batch.pop("future_target")
output = network(x_train_batch)
loss = F.mse_loss(output, y_train_batch)
loss.backward()
optimizer.step()
optimizer.clear_grad()
if __name__ == "__main__":
main()
Since you haven't replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。