GTS Questions about the metrics.

Hello!

I noticed that the code seems to calculate the metrics (mae, mape, rmse) for each mini-batch and then average them during the test phase. However, due to the unbalanced distribution of nulls in the test data, I find that such an approach gives different results compared to the canonical approaches used in previous work (e.g. DCRNN and GWNet), even if the dataset is padded.

Jan 20 '22 04:01 zezhishao

I used both the evaluation methods at the same time and got the following results:


2022-01-21 20:23:27,697 - INFO - Epoch [44/200] (16875) train_mae: 2.5245, val_mae: 3.5958
--------------------------(minibatch)
2022-01-21 20:23:41,579 - INFO - Horizon 15mins: mae: 2.6843, mape: 0.0678, rmse: 5.2968
2022-01-21 20:23:41,579 - INFO - Horizon 30mins: mae: 3.1403, mape: 0.0830, rmse: 6.4677
2022-01-21 20:23:41,579 - INFO - Horizon 60mins: mae: 3.6888, mape: 0.1019, rmse: 7.7696
--------------------------(fullbatch)
2022-01-21 20:23:55,685 - INFO - Horizon 15mins: mae: 2.7858, mape: 0.0704, rmse: 5.3782
2022-01-21 20:23:55,686 - INFO - Horizon 30mins: mae: 3.2629, mape: 0.0861, rmse: 6.5821
2022-01-21 20:23:55,686 - INFO - Horizon 60mins: mae: 3.8445, mape: 0.1061, rmse: 7.9454

The code of the evaluation method using fullbatch is as follows:


def evaluate_new(self,label, dataset='test', batches_seen=0, gumbel_soft=True):
    """
    Computes mean L1Loss
    :return: mean L1Loss
    """
    with torch.no_grad():
        self.GTS_model = self.GTS_model.eval()

        val_iterator = self._data['{}_loader'.format(dataset)].get_iterator()
        losses = []
        mapes = []
        #rmses = []
        mses = []
        temp = self.temperature
        
        l_3 = []
        m_3 = []
        r_3 = []
        l_6 = []
        m_6 = []
        r_6 = []
        l_12 = []
        m_12 = []
        r_12 = []

        y_pred_list = []
        y_true_list = []
        for batch_idx, (x, y) in enumerate(val_iterator):
            x, y = self._prepare_data(x, y)

            output, mid_output = self.GTS_model(label, x, self._train_feas, temp, gumbel_soft)

            if label == 'without_regularization': 
                loss = self._compute_loss(y, output)
                y_true = self.standard_scaler.inverse_transform(y)
                y_pred = self.standard_scaler.inverse_transform(output)
                y_pred_list.append(y_pred)
                y_true_list.append(y_true)
                losses.append(loss.item())
                
            else:
                loss_1 = self._compute_loss(y, output)
                pred = torch.sigmoid(mid_output.view(mid_output.shape[0] * mid_output.shape[1]))
                true_label = self.adj_mx.view(mid_output.shape[0] * mid_output.shape[1]).to(device)
                compute_loss = torch.nn.BCELoss()
                loss_g = compute_loss(pred, true_label)
                loss = loss_1 + loss_g
                # option
                # loss = loss_1 + 10*loss_g
                losses.append((loss_1.item()+loss_g.item()))

                y_true = self.standard_scaler.inverse_transform(y)
                y_pred = self.standard_scaler.inverse_transform(output)
                y_pred_list.append(y_pred)
                y_true_list.append(y_true)
            
        y_pred_full = torch.cat(y_pred_list, dim=1)
        y_true_full = torch.cat(y_true_list, dim=1)

        mae  = masked_mae_loss(y_pred_full, y_true_full).item()
        mape = masked_mape_loss(y_pred_full, y_true_full).item()
        rmse = torch.sqrt(masked_mse_loss(y_pred_full, y_true_full)).item()
        
        # Followed the DCRNN TensorFlow Implementation
        l_3 = masked_mae_loss(y_pred_full[2:3], y_true_full[2:3]).item()
        m_3 = masked_mape_loss(y_pred_full[2:3], y_true_full[2:3]).item()
        r_3 = masked_mse_loss(y_pred_full[2:3], y_true_full[2:3]).item()

        l_6 = masked_mae_loss(y_pred_full[5:6], y_true_full[5:6]).item()
        m_6 = masked_mape_loss(y_pred_full[5:6], y_true_full[5:6]).item()
        r_6 = masked_mse_loss(y_pred_full[5:6], y_true_full[5:6]).item()
        l_12 = masked_mae_loss(y_pred_full[11:12], y_true_full[11:12]).item()
        m_12 = masked_mape_loss(y_pred_full[11:12], y_true_full[11:12]).item()
        r_12 = masked_mse_loss(y_pred_full[11:12], y_true_full[11:12]).item()

        if dataset == 'test':
            
            # Followed the DCRNN TensorFlow Implementation
            message = 'Horizon 15mins: mae: {:.4f}, mape: {:.4f}, rmse: {:.4f}'.format(np.mean(l_3), np.mean(m_3),
                                                                                       np.sqrt(np.mean(r_3)))
            self._logger.info(message)
            message = 'Horizon 30mins: mae: {:.4f}, mape: {:.4f}, rmse: {:.4f}'.format(np.mean(l_6), np.mean(m_6),
                                                                                       np.sqrt(np.mean(r_6)))
            self._logger.info(message)
            message = 'Horizon 60mins: mae: {:.4f}, mape: {:.4f}, rmse: {:.4f}'.format(np.mean(l_12), np.mean(m_12),
                                                                                       np.sqrt(np.mean(r_12)))
            self._logger.info(message)

        self._writer.add_scalar('{} loss'.format(dataset), mae, batches_seen)
        if label == 'without_regularization':
            return mae, mape, rmse
        else:
            return mae

Jan 21 '22 14:01 zezhishao

Hi, thank you so much for discovering this interesting and important question. From the mathematical calculations, they should be the same. However, as you said, the nulls may affect the results. I will recheck this part. In addition, thanks for providing the implementation and results.

Jan 21 '22 18:01 chaoshangcs

Hi, I think that is because you should weight the contribution of each batch to the total by considering the percentage of valid data it contains, instead of simply averaging everything out.

Mar 27 '22 11:03 andreacini

Hi, I think that is because you should weight the contribution of each batch to the total by considering the percentage of valid data it contains, instead of simply averaging everything out.

Hi, thanks for your advice, and I also thought about this possibility. But I noticed that all the datasets (train/val/test) are padded here, which means the amount of data per batch is the same, which is equivalent to simply averaging everything out.

Mar 27 '22 13:03 zezhishao

Yes, but the amount of valid data is not the same across batches (as the nulls are masked out), so padding does not change anything: each batch should be given a different weight.

If you check the computation of the masked metrics you'll see what I mean.

Mar 27 '22 13:03 andreacini

Wow, yes, I get it. Thank you for your reminder! I never noticed the difference between GTS and other baselines (such as Graph WaveNet) in the metric calculation function.

Mar 27 '22 14:03 zezhishao

It appears that the differences in how metrics are calculated have significantly affected the fairness of the comparison... That may be an important problem...

Mar 27 '22 14:03 zezhishao

GTS GTS copied to clipboard

Questions about the metrics.

GTS
GTS copied to clipboard