DST-CBC
DST-CBC copied to clipboard
Question about the paper
Hi! I just read your paper of DMT and quite appreciate your work. But I can't fully understande the statement in the paper:"It can be interpreted that a relatively larger γ1 represents a more emphasized entropy minimization, a larger γ2 represents a more emphasized mutual learning. Largeγ values are often better for high-noise scenarios, or to maintain larger intermodel disagreement." Could you please explain it? Thanks a lot!
@Hugo-cell111 FYI, larger γ corresponds to larger differences in loss weighting. Since loss weighting is the core of the dynamic loss, hereby the use of the expression "emphasize".
- γ1 is used when models predict the same label, which corresponds to entropy minimization.
- γ2 is used when models predict different labels, which corresponds to mutual learning.
As for the last statement on high-noise and disagreement, it is more empirical. You can understand it as the effects of a overall low learning rate (although not exactly so considering the exponential dynamic weight), the models won't make large steps towards noisy labels or each other.