DeepCore Further explanation about the degraded performance of baselines than random sampling.

First of all, thank you so much for providing such a good github repository.

I ran the code myself, and the code is very reproducible and helpful to concatenate my work on it.

One questionable point is as follows: that some papers speak louder that their method definitely outperforms the randomly sampled coreset. "Deep Learning on a Data Diet: Finding Important Examples Early in Training" is one of representative paper of them. However, according to the report of this paper, it seems to record lower performance than random sampling for coresets below 10%.

This seems to be a very strong argument, is this phenomenon becaus of not enough hyper-parameter tuning in this repository? Or are the claims of these studies partially wrong?

Oct 05 '22 05:10 SJShin-AI

Hi, thanks for the question. We are cautious about the conclusion. We only conclude that random selection is still a strong baseline which may work better than proposed methods in many settings. The exact results in practice can be different due to different implementation, settings (e.g. coreset size, architecture) and hyperparameters (e.g. learning rate, epochs). It is hard to give one conclusion for all experiments.

Oct 05 '22 13:10 PatrickZH

Thanks for your response. Then, is the provided baselines are all from the author's code? or not? I think that the strength of the claim depends upon the reproducibility of each algorithm. I conjecture that each baseline is from the author's code, which implies the sufficient credibility.

Oct 06 '22 07:10 SJShin-AI

All baselines are re-implemented by ourselves based on the paper and authors' code, because we need to compare multiple algorithms from different papers in the same experiment setting. Maybe there exist some hidden mistakes in the code. If you find it, let us know. Thanks!

Oct 06 '22 09:10 PatrickZH

During the check, i just found out that some loss functions are wrongly calculated as follows:

For instance, in the Grad-Match method code, loss = self.criterion(torch.nn.functional.softmax(outputs, dim=1), targets.to(self.args.device)).sum()

the self.criterion is torch.nn.crossentropyloss which calculate the loss based on (logits, targets). However, the loss is calculated based on the (softmax output, targets).

Oct 07 '22 05:10 SJShin-AI

Thanks for pointing out it. We will correct it.

Oct 10 '22 20:10 PatrickZH

A quick comment:

From "Deep Learning on a Data Diet: Finding Important Examples Early in Training" Page 6

Interestingly, at extreme levels of pruning with either EL2N or GraNd scores, we observe a sharp drop in performance. We hypothesize that this is because at high levels of pruning, using either GraNd or EL2N scores leads to bad coverage of the data distribution. By only focusing on the highest error examples, it is likely that an entire subpopulation of significant size that is present in the test data is now excluded from the training set. We only fit a small number of very difficult examples and do not keep enough of a variety of examples for training models with good test error.

If you look at Figure 1 in that paper, they only report the results whose pruning fraction is lower than 0.5~0.7. They indeed somewhat hide the weakness of their method but I see no contradiction between their results and your observation.

Feb 07 '23 16:02 xingjian-zhang

Thanks for the comment!

Feb 08 '23 05:02 PatrickZH

I feel the implementation is somewhat incoherent with my understanding. Line 49 of grand.py:

  self.norm_matrix[i * self.args.selection_batch:min((i + 1) * self.args.selection_batch, sample_num),
  self.cur_repeat] = torch.norm(torch.cat([bias_parameters_grads, (
          self.model.embedding_recorder.embedding.view(batch_num, 1, embedding_dim).repeat(1,
                               self.args.num_classes, 1) * bias_parameters_grads.view(
                               batch_num, self.args.num_classes, 1).repeat(1, 1, embedding_dim)).
                               view(batch_num, -1)], dim=1), dim=1, p=2)

This is not calculating the gradient norm. Correct me if I am wrong.

Feb 23 '23 01:02 KAI-YUE

DeepCore DeepCore copied to clipboard

Further explanation about the degraded performance of baselines than random sampling.

DeepCore
DeepCore copied to clipboard