GTS
GTS copied to clipboard
Questions about the metrics.
Hello!
I noticed that the code seems to calculate the metrics (mae, mape, rmse) for each mini-batch and then average them during the test phase. However, due to the unbalanced distribution of nulls in the test data, I find that such an approach gives different results compared to the canonical approaches used in previous work (e.g. DCRNN and GWNet), even if the dataset is padded.
I used both the evaluation methods at the same time and got the following results:
2022-01-21 20:23:27,697 - INFO - Epoch [44/200] (16875) train_mae: 2.5245, val_mae: 3.5958
--------------------------(minibatch)
2022-01-21 20:23:41,579 - INFO - Horizon 15mins: mae: 2.6843, mape: 0.0678, rmse: 5.2968
2022-01-21 20:23:41,579 - INFO - Horizon 30mins: mae: 3.1403, mape: 0.0830, rmse: 6.4677
2022-01-21 20:23:41,579 - INFO - Horizon 60mins: mae: 3.6888, mape: 0.1019, rmse: 7.7696
--------------------------(fullbatch)
2022-01-21 20:23:55,685 - INFO - Horizon 15mins: mae: 2.7858, mape: 0.0704, rmse: 5.3782
2022-01-21 20:23:55,686 - INFO - Horizon 30mins: mae: 3.2629, mape: 0.0861, rmse: 6.5821
2022-01-21 20:23:55,686 - INFO - Horizon 60mins: mae: 3.8445, mape: 0.1061, rmse: 7.9454
The code of the evaluation method using fullbatch is as follows:
def evaluate_new(self,label, dataset='test', batches_seen=0, gumbel_soft=True):
"""
Computes mean L1Loss
:return: mean L1Loss
"""
with torch.no_grad():
self.GTS_model = self.GTS_model.eval()
val_iterator = self._data['{}_loader'.format(dataset)].get_iterator()
losses = []
mapes = []
#rmses = []
mses = []
temp = self.temperature
l_3 = []
m_3 = []
r_3 = []
l_6 = []
m_6 = []
r_6 = []
l_12 = []
m_12 = []
r_12 = []
y_pred_list = []
y_true_list = []
for batch_idx, (x, y) in enumerate(val_iterator):
x, y = self._prepare_data(x, y)
output, mid_output = self.GTS_model(label, x, self._train_feas, temp, gumbel_soft)
if label == 'without_regularization':
loss = self._compute_loss(y, output)
y_true = self.standard_scaler.inverse_transform(y)
y_pred = self.standard_scaler.inverse_transform(output)
y_pred_list.append(y_pred)
y_true_list.append(y_true)
losses.append(loss.item())
else:
loss_1 = self._compute_loss(y, output)
pred = torch.sigmoid(mid_output.view(mid_output.shape[0] * mid_output.shape[1]))
true_label = self.adj_mx.view(mid_output.shape[0] * mid_output.shape[1]).to(device)
compute_loss = torch.nn.BCELoss()
loss_g = compute_loss(pred, true_label)
loss = loss_1 + loss_g
# option
# loss = loss_1 + 10*loss_g
losses.append((loss_1.item()+loss_g.item()))
y_true = self.standard_scaler.inverse_transform(y)
y_pred = self.standard_scaler.inverse_transform(output)
y_pred_list.append(y_pred)
y_true_list.append(y_true)
y_pred_full = torch.cat(y_pred_list, dim=1)
y_true_full = torch.cat(y_true_list, dim=1)
mae = masked_mae_loss(y_pred_full, y_true_full).item()
mape = masked_mape_loss(y_pred_full, y_true_full).item()
rmse = torch.sqrt(masked_mse_loss(y_pred_full, y_true_full)).item()
# Followed the DCRNN TensorFlow Implementation
l_3 = masked_mae_loss(y_pred_full[2:3], y_true_full[2:3]).item()
m_3 = masked_mape_loss(y_pred_full[2:3], y_true_full[2:3]).item()
r_3 = masked_mse_loss(y_pred_full[2:3], y_true_full[2:3]).item()
l_6 = masked_mae_loss(y_pred_full[5:6], y_true_full[5:6]).item()
m_6 = masked_mape_loss(y_pred_full[5:6], y_true_full[5:6]).item()
r_6 = masked_mse_loss(y_pred_full[5:6], y_true_full[5:6]).item()
l_12 = masked_mae_loss(y_pred_full[11:12], y_true_full[11:12]).item()
m_12 = masked_mape_loss(y_pred_full[11:12], y_true_full[11:12]).item()
r_12 = masked_mse_loss(y_pred_full[11:12], y_true_full[11:12]).item()
if dataset == 'test':
# Followed the DCRNN TensorFlow Implementation
message = 'Horizon 15mins: mae: {:.4f}, mape: {:.4f}, rmse: {:.4f}'.format(np.mean(l_3), np.mean(m_3),
np.sqrt(np.mean(r_3)))
self._logger.info(message)
message = 'Horizon 30mins: mae: {:.4f}, mape: {:.4f}, rmse: {:.4f}'.format(np.mean(l_6), np.mean(m_6),
np.sqrt(np.mean(r_6)))
self._logger.info(message)
message = 'Horizon 60mins: mae: {:.4f}, mape: {:.4f}, rmse: {:.4f}'.format(np.mean(l_12), np.mean(m_12),
np.sqrt(np.mean(r_12)))
self._logger.info(message)
self._writer.add_scalar('{} loss'.format(dataset), mae, batches_seen)
if label == 'without_regularization':
return mae, mape, rmse
else:
return mae
Hi, thank you so much for discovering this interesting and important question. From the mathematical calculations, they should be the same. However, as you said, the nulls may affect the results. I will recheck this part. In addition, thanks for providing the implementation and results.
Hi, I think that is because you should weight the contribution of each batch to the total by considering the percentage of valid data it contains, instead of simply averaging everything out.
Hi, I think that is because you should weight the contribution of each batch to the total by considering the percentage of valid data it contains, instead of simply averaging everything out.
Hi, thanks for your advice, and I also thought about this possibility. But I noticed that all the datasets (train/val/test) are padded here, which means the amount of data per batch is the same, which is equivalent to simply averaging everything out.
Yes, but the amount of valid data is not the same across batches (as the nulls are masked out), so padding does not change anything: each batch should be given a different weight.
If you check the computation of the masked metrics you'll see what I mean.
Wow, yes, I get it. Thank you for your reminder! I never noticed the difference between GTS and other baselines (such as Graph WaveNet) in the metric calculation function.
It appears that the differences in how metrics are calculated have significantly affected the fairness of the comparison... That may be an important problem...