VISTA Train
Can you provide some details about the training of stage 2? When I trained using the same approach like learning rate and hard samples, I found that the IT-I loss was much lower than the T-IT loss. Is there a rough value to reference? Did you perform any additional operations to balance the two tasks, or did you simply alternate between tasks for each batch?
Thanks for your interest!
We did not intentionally design a strategy to balance the losses of the two tasks. As stated in the paper, the two tasks are trained in alternating batches.
Thanks, When I was training, the loss for it-i converged quickly, but the loss for t-it was relatively large (3.2 vs 0.8). I wonder if you have also observed a similar phenomenon during your training?
In our training, the losses for the two tasks converged to 2.5 (T2IT) and 0.8 (IT2I) respectively. This phenomenon is normal, as the tasks and data distributions for these datasets are not identical.
Moreover, given that these are synthesized data, it can be viewed as a self-supervised pre-training process, and a high loss does not necessarily imply non-convergence in training.