Paddle 同样代码在CPU上可以正常训练，而在GPU上训练过程中报错

bug描述 Describe the Bug

在cpu上模型可以正常训练，但在gpu上训练时，中间产生nan报错：

[2022-08-04 08:38:08,797] [paddlets.models.dl.paddlepaddle.callbacks.callbacks] [INFO] epoch 000| loss: 7.798771| val_0_mse: 5.539458| val_0_mae: 1.806103| 0:00:02s
[2022-08-04 08:38:09,079] [paddlets.models.dl.paddlepaddle.callbacks.callbacks] [INFO] epoch 001| loss: 4.379002| val_0_mse: 1.134817| val_0_mae: 0.832125| 0:00:02s
[2022-08-04 08:38:09,313] [paddlets.models.dl.paddlepaddle.callbacks.callbacks] [INFO] epoch 002| loss: 4.343719| val_0_mse: 10.029250| val_0_mae: 2.479812| 0:00:03s
[2022-08-04 08:38:09,524] [paddlets.models.dl.paddlepaddle.callbacks.callbacks] [INFO] epoch 003| loss: 4.042698| val_0_mse: 0.698713| val_0_mae: 0.658151| 0:00:03s
[2022-08-04 08:38:09,743] [paddlets.models.dl.paddlepaddle.callbacks.callbacks] [INFO] epoch 004| loss: 2.243271| val_0_mse: 1.619683| val_0_mae: 0.961455| 0:00:03s
Traceback (most recent call last):
  File "repro_nhits.py", line 203, in <module>
    main()
  File "repro_nhits.py", line 196, in main
    fit_time, predict_time = run_one_model(models["nhits"])
  File "repro_nhits.py", line 38, in run_one_model
    model.fit(ts_train_scaled, ts_val_scaled)
  File "/usr/local/lib/python3.7/dist-packages/paddlets/models/dl/paddlepaddle/paddle_base_impl.py", line 321, in fit
    self._fit(train_dataloader, valid_dataloaders)
  File "/usr/local/lib/python3.7/dist-packages/paddlets/models/dl/paddlepaddle/paddle_base_impl.py", line 351, in _fit
    self._predict_epoch(eval_name, valid_dataloader)
  File "/usr/local/lib/python3.7/dist-packages/paddlets/models/dl/paddlepaddle/paddle_base_impl.py", line 469, in _predict_epoch
    metrics_logs = self._metric_container_dict[name](y_true, scores)
  File "/usr/local/lib/python3.7/dist-packages/paddlets/metrics/metrics.py", line 166, in __call__
    res = metric.metric_fn(y_true, y_score)
  File "/usr/local/lib/python3.7/dist-packages/paddlets/metrics/utils.py", line 41, in wrapper
    return func(obj, y_true, y_score)
  File "/usr/local/lib/python3.7/dist-packages/paddlets/metrics/metrics.py", line 49, in metric_fn
    return metrics.mean_squared_error(y_true, y_score)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_regression.py", line 439, in mean_squared_error
    y_true, y_pred, multioutput
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_regression.py", line 96, in _check_reg_targets
    y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 801, in check_array
    _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 117, in _assert_all_finite
    type_err, msg_dtype if msg_dtype is not None else X.dtype
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

其他补充信息 Additional Supplementary Information

No response

Aug 04 '22 09:08 bianchuanxin

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

Aug 04 '22 09:08 paddle-bot[bot]

从报错信息看不出什么问题可以提供下版本信息和最小可复现代码吗？

Aug 05 '22 02:08 yaoxuefeng6

@yaoxuefeng6 最小可复现代码：

运行环境：

paddle提供的 docker 镜像： registry.baidubce.com/paddlepaddle/paddle:2.3.1-gpu-cuda11.2-cudnn8
GPU型号：Nvidia A30
用于复现问题的数据下载链接： https://github.com/KeHuoBot/storage/raw/main/paddle_dataset_train

问题描述：同一个输入tensor，已知其取值范围正常。对该tensor调用相同的 paddle.nn.functional.interpolate 接口后：在CPU上输出无异常，但是在GPU上计算结果中包含大量 inf 值，不符合预期。希望排查原因。

用于复现的代码简述：问题在_NHiTSModule网络的forward函数中复现（即 149行 - 162行, 被“ISSUE_START” 和 “ISSUE_END” 包围的部分）。forward输入的tensor（即theta_backcast变量）中没有 inf 值，整个tensor中的取值范围都很正常。但是使用GPU计算后，输出x_hat变量中就出现了inf值。

代码:

# !/usr/bin/env python3
# -*- coding: UTF-8 -*-

import pickle
from typing import List, Dict, Optional, Tuple
import numpy as np
import paddle
from paddle import nn
import paddle.nn.functional as F


ACTIVATIONS = [
    "ReLU",
    "RReLU",
    "PReLU",
    "ELU",
    "Softplus",
    "Tanh",
    "SELU",
    "LeakyReLU",
    "Sigmoid",
    "GELU",
]


class _Block(nn.Layer):
    def __init__(
            self,
            in_chunk_len: int,
            out_chunk_len: int,
            in_chunk_len_flat: int,
            target_dim: int,
            known_cov_dim: int,
            observed_cov_dim: int,
            num_layers: int,
            layer_width: int,
            pooling_kernel_size: int,
            n_freq_downsample: int,
            batch_norm: bool,
            dropout: float,
            activation: str,
            MaxPool1d: bool,
    ):
        super().__init__()
        self._in_chunk_len = in_chunk_len
        self._out_chunk_len = out_chunk_len
        self._target_dim = target_dim

        self._activation = getattr(nn, activation)()
        n_theta_backcast = max(in_chunk_len // n_freq_downsample * target_dim, 1)
        n_theta_forecast = max(out_chunk_len // n_freq_downsample * target_dim, 1)

        # pooling layer
        pool1d = nn.MaxPool1D if MaxPool1d else nn.AvgPool1D

        self.pooling_layer = pool1d(
            kernel_size=pooling_kernel_size,
            stride=pooling_kernel_size,
            ceil_mode=True,
        )
        # layer widths
        in_len = int(np.ceil(in_chunk_len / pooling_kernel_size)) * (target_dim + known_cov_dim + observed_cov_dim) + \
                 int(np.ceil(out_chunk_len / pooling_kernel_size)) * known_cov_dim

        layer_widths = [in_len] + [layer_width] * num_layers
        # FC layers
        layers = []
        for i in range(num_layers):
            layers.append(
                nn.Linear(
                    in_features=layer_widths[i],
                    out_features=layer_widths[i + 1],
                )
            )
            layers.append(self._activation)

            if batch_norm:
                layers.append(nn.BatchNorm1d(num_features=layer_widths[i + 1]))
            if dropout > 0:
                layers.append(nn.Dropout(p=dropout))
        self.layers = nn.Sequential(*layers)

        # Fully connected layer producing forecast/backcast expansion coeffcients (waveform generator parameters).
        # The coefficients are emitted for each parameter of the likelihood for the forecast.
        self.backcast_linear_layer = nn.Linear(
            in_features=layer_width, out_features=n_theta_backcast
        )
        self.forecast_linear_layer = nn.Linear(
            in_features=layer_width, out_features=n_theta_forecast
        )

    def forward(
            self,
            backcast: paddle.Tensor,
            known_cov: paddle.Tensor,
            observed_cov: paddle.Tensor
    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """
        forward block.

        Args:
            backcast: past target, shape: [batch_size, in_chunk_len, target_dim]
            known_cov: known covariates, shape: [batch_size, in_chunk_len + target_length, known_cov_dim]
            observed_cov: observed covariates, shape: [batch_size, in_chunk_len, observed_cov_dim]

        Returns:
            x_hat: approximation of backcast on specific frequency, shape [batch_size, in_chunk_len, target_dim]
            y_hat: tensor containing the forward forecast of the block, shape [batch_size, out_chunk_len, target_dim]
        """
        # compose feature x
        batch_size = backcast.shape[0]
        # concat backcast, known_cov, observed_cov if any;
        past_feature = [backcast]
        future_feature = None
        if known_cov is not None:
            past_feature.append(known_cov[:, :self._in_chunk_len, :])
            future_feature = known_cov[:, self._in_chunk_len:, :].transpose(perm=[0, 2, 1])
        if observed_cov is not None:
            past_feature.append(observed_cov)
        past_feature = paddle.concat(x=past_feature, axis=2).transpose(perm=[0, 2, 1])  # (N,C,L)
        # pooling layer
        x = self.pooling_layer(past_feature).reshape([batch_size, -1])
        if future_feature is not None:
            x_ = self.pooling_layer(future_feature).reshape([batch_size, -1])
            x = paddle.concat([x, x_], axis=1)

        # fully connected layer stack
        x = self.layers(x)

        # forked linear layers producing waveform generator parameters
        theta_backcast = self.backcast_linear_layer(x)  # in_chunk_len * target_dim
        theta_forecast = self.forecast_linear_layer(x)  # out_chunk_len * target_dim

        # set the expansion coefs in last dimension for the forecasts
        theta_forecast = theta_forecast.reshape((batch_size, self._target_dim, -1))

        # set the expansion coefs in last dimension for the backcasts
        theta_backcast = theta_backcast.reshape((batch_size, self._target_dim, -1))

        # interpolate both backcast and forecast from the theta_backcast and theta_forecast
        x_hat = F.interpolate(
            theta_backcast, size=[self._in_chunk_len], mode="linear", data_format='NCW'
        )
        y_hat = F.interpolate(
            theta_forecast, size=[self._out_chunk_len], mode="linear", data_format='NCW'
        )

        ################## ISSUE_START ###########
        if paddle.isinf(x_hat).numpy().any():
            print("all inf index of x_hat:")
            # 153行 np.where 打印信息能显示出所有等于 inf 值的下标索引：
            print(np.where(paddle.isinf(x_hat).numpy()))

            # 下面的打印max / min / abs_min 信息，保证了输入的 tensor 取值范围很正常，没有很大、很小的值：
            abs_theta_backcast = paddle.abs(theta_backcast)
            print(
                "max(theta_backcast) = %s, min(theta_backcast) = %s, abs_min(theta_backcast) = %s" %
                (paddle.max(theta_backcast), paddle.min(theta_backcast), paddle.min(abs_theta_backcast))
            )

            # 遇到问题直接退出程序.
            exit(1)
        ################## ISSUE_END ###########

        x_hat = paddle.transpose(x_hat, perm=[0, 2, 1])
        y_hat = paddle.transpose(y_hat, perm=[0, 2, 1])
        return x_hat, y_hat


class _Stack(nn.Layer):
    """
    Stack implementation of the NHiTS architecture, comprises multiple basic blocks.

    Args:
        in_chunk_len: The length of input sequence fed to the model.
        out_chunk_len: The length of the forecast of the model.
        in_chunk_len_flat: The length of the flattened input sequence(produced by concatenating past_target, known_cov, observed_cov) fed to the model.
        num_blocks: The number of blocks making up this stack.
        num_layers: The number of fully connected layers preceding the final forking layers in each block.
        layer_width: The number of neurons that make up each fully connected layer in each block.
        target_dim: The dimension of target.
        known_cov_dim(int): The number of known covariates.
        observed_cov_dim(int): The number of observed covariates.
        pooling_kernel_size: The kernel size for the initial pooling layer.
        n_freq_downsample: The factor by which to downsample time at the output (before interpolating).
        batch_norm: Whether to use batch norm.
        dropout: Dropout probability.
        activation: The activation function of encoder/decoder intermediate layer.
        MaxPool1d: Whether to use MaxPool1d pooling, False uses AvgPool1d.
    """

    def __init__(
            self,
            in_chunk_len: int,
            out_chunk_len: int,
            in_chunk_len_flat: int,
            num_blocks: int,
            num_layers: int,
            layer_width: int,
            target_dim: int,
            known_cov_dim: int,
            observed_cov_dim: int,
            pooling_kernel_sizes: Tuple[int],
            n_freq_downsample: Tuple[int],
            batch_norm: bool,
            dropout: float,
            activation: str,
            MaxPool1d: bool,
    ):
        super().__init__()
        self.in_chunk_len = in_chunk_len
        self.out_chunk_len = out_chunk_len
        self._target_dim = target_dim

        # TODO: leave option to share weights across blocks?
        self._blocks_list = [
            _Block(
                in_chunk_len,
                out_chunk_len,
                in_chunk_len_flat,
                target_dim,
                known_cov_dim,
                observed_cov_dim,
                num_layers,
                layer_width,
                pooling_kernel_sizes[i],
                n_freq_downsample[i],
                batch_norm=(
                        batch_norm and i == 0
                ),  # batch norm only on first block of first stack
                dropout=dropout,
                activation=activation,
                MaxPool1d=MaxPool1d,
            )
            for i in range(num_blocks)
        ]
        self._blocks = nn.LayerList(self._blocks_list)

    def forward(
            self,
            backcast: paddle.Tensor,
            known_cov: paddle.Tensor,
            observed_cov: paddle.Tensor
    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """
        forward stack.

        Args:
            backcast(paddle.Tensor): past target, shape: [batch_size, in_chunk_len, target_dim].
            past_covariate(paddle.Tensor): past known covariate, shape: [batch_size, in_chunk_len, covariate_dim].
            future_covariate(paddle.Tensor): future known covariate, shape: [batch_size, out_chunk_len, covariate_dim].

        Returns:
            stack_residual: residual tensor of backcast, shape [batch_size, in_chunk_len, target_dim].
            stack_forecast tensor containing the forward forecast of the stack, shape [batch_size, out_chunk_len].
        """
        # init stack_forecast as paddle.zeros
        stack_forecast = paddle.zeros(
            shape=(backcast.shape[0], self.out_chunk_len, self._target_dim),
            dtype=backcast.dtype,
        )
        for block in self._blocks_list:
            # pass input through block
            x_hat, y_hat = block(backcast, known_cov, observed_cov)
            # add block forecast to stack forecast
            stack_forecast = stack_forecast + y_hat
            # subtract backcast from input to produce residual

            backcast = backcast - x_hat
        stack_residual = backcast
        return stack_residual, stack_forecast


class _NHiTSModule(nn.Layer):
    """
    Implementation of NHiTS, cover multi-targets, known_covariates, observed_covariates.
    """
    def __init__(
        self,
        in_chunk_len: int,
        out_chunk_len: int,
        target_dim: int,
        known_cov_dim: int,
        observed_cov_dim: int,
        num_stacks: int,
        num_blocks: int,
        num_layers: int,
        layer_widths: List[int],
        pooling_kernel_sizes: Optional[Tuple[Tuple[int]]],
        n_freq_downsample: Optional[Tuple[Tuple[int]]],
        batch_norm: bool,
        dropout: float,
        activation: str,
        MaxPool1d: bool,
    ):
        super().__init__()
        self._known_cov_dim = known_cov_dim
        self._observed_cov_dim = observed_cov_dim
        self._target_dim = target_dim
        self._target_length = out_chunk_len
        input_dim = target_dim + known_cov_dim + observed_cov_dim
        self._in_chunk_len_multi = in_chunk_len * input_dim + out_chunk_len * known_cov_dim
        self._pooling_kernel_sizes, self._n_freq_downsample = self._check_pooling_downsampling(
            pooling_kernel_sizes,
            n_freq_downsample,
            in_chunk_len,
            out_chunk_len,
            num_blocks,
            num_stacks
        )
        self._stacks_list = [
            _Stack(
                in_chunk_len,
                out_chunk_len,
                self._in_chunk_len_multi,
                num_blocks,
                num_layers,
                layer_widths[i],
                target_dim,
                known_cov_dim,
                observed_cov_dim,
                self._pooling_kernel_sizes[i],
                self._n_freq_downsample[i],
                batch_norm=(batch_norm and i == 0),  # batch norm only on the first block of the first stack
                dropout=dropout,
                activation=activation,
                MaxPool1d=MaxPool1d,
            )
            for i in range(num_stacks)
        ]

        self._stacks = nn.LayerList(self._stacks_list)
        self._stacks_list[-1]._blocks[-1].backcast_linear_layer.stop_gradient = True

    def _check_pooling_downsampling(
            self,
            pooling_kernel_sizes: Optional[Tuple[Tuple[int]]],
            n_freq_downsample: Optional[Tuple[Tuple[int]]],
            in_len: int,
            out_len: int,
            num_blocks: int,
            num_stacks: int
    ):

        def _check_sizes(tup, name):
            pass

        if pooling_kernel_sizes is None:
            # make stacks handle different frequencies
            # go from in_len/2 to 1 in num_stacks steps:
            max_v = max(in_len // 2, 1)
            pooling_kernel_sizes = tuple(
                (max(int(v), 1),) * num_blocks
                for v in max_v // np.geomspace(1, max_v, num_stacks)
            )
        else:
            # check provided pooling format
            _check_sizes(pooling_kernel_sizes, "`pooling_kernel_sizes`")

        if n_freq_downsample is None:
            # go from out_len/2 to 1 in num_stacks steps:
            max_v = max(out_len // 2, 1)
            n_freq_downsample = tuple(
                (max(int(v), 1),) * num_blocks
                for v in max_v // np.geomspace(1, max_v, num_stacks)
            )
        else:
            # check provided downsample format
            _check_sizes(n_freq_downsample, "`n_freq_downsample`")
        return pooling_kernel_sizes, n_freq_downsample

    def forward(
            self,
            data: Dict[str, paddle.Tensor]
    ) -> paddle.Tensor:
        backcast = data["past_target"]
        known_cov = data["known_cov"] if self._known_cov_dim > 0 else None
        observed_cov = data["observed_cov"] if self._observed_cov_dim > 0 else None
        # init forecast tensor
        forecast = paddle.zeros(
            shape=(backcast.shape[0], self._target_length, self._target_dim))
        for stack_index, stack in enumerate(self._stacks_list):
            # compute stack output
            stack_residual, stack_forecast = stack(backcast, known_cov, observed_cov)
            # accumulate stack_forecast to final output
            forecast = forecast + stack_forecast
            # set current stack residual as input for next stack
            backcast = stack_residual

        return forecast


def main():
    batch_size = 512
    max_epochs = 10

    train_dataset_filename = "paddle_dataset_train"
    with open(train_dataset_filename, "rb") as f:
        train_dataset = pickle.load(f)
    train_dataloader = paddle.io.DataLoader(dataset=train_dataset, batch_size=batch_size)

    # repro the issue here.
    feature_day_num = 3
    network = _NHiTSModule(
        in_chunk_len=feature_day_num * 24,
        out_chunk_len=24,
        target_dim=1,
        known_cov_dim=0,
        observed_cov_dim=11,
        num_stacks=3,
        num_blocks=3,
        num_layers=2,
        layer_widths=[512, 512, 512],
        pooling_kernel_sizes=None,
        n_freq_downsample=None,
        batch_norm=False,
        dropout=0.1,
        activation="ReLU",
        MaxPool1d=True
    )

    optimizer = paddle.optimizer.Adam(learning_rate=1e-4, parameters=network.parameters())
    for epoch_idx in range(max_epochs):
        network.train()
        for batch_idx, curr_batch in enumerate(train_dataloader):
            x_train_batch = curr_batch
            for k in x_train_batch:
                x_train_batch[k] = x_train_batch[k].astype("float32")
            y_train_batch = x_train_batch.pop("future_target")
            output = network(x_train_batch)
            loss = F.mse_loss(output, y_train_batch)
            loss.backward()
            optimizer.step()
            optimizer.clear_grad()


if __name__ == "__main__":
    main()

Aug 08 '22 07:08 kehuo

Since you haven't replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复，我们将关闭这个issue/pr。若问题未解决或有后续问题，请随时重新打开，我们会继续跟进。

Aug 15 '23 06:08 paddle-bot[bot]

Paddle Paddle copied to clipboard

同样代码在CPU上可以正常训练，而在GPU上训练过程中报错

bug描述 Describe the Bug

其他补充信息 Additional Supplementary Information

Paddle
Paddle copied to clipboard