The combination of different mode and split leads to wrong calculation for number of batches and number of epochs

Open DavdGao opened this issue 3 years ago • 0 comments

Describe the bug As the title says, the current number of batches and epochs are calculated for each split as follows:

        ...
        # Process training data
        if self.train_data is not None or self.train_loader is not None:
            # Calculate the number of update steps during training given the
            # local_update_steps
            num_train_batch, num_train_batch_last_epoch, num_train_epoch, \
                num_total_train_batch = self.pre_calculate_batch_epoch_num(
                    self.cfg.train.local_update_steps)

            self.num_train_epoch = num_train_epoch
            self.num_train_batch = num_train_batch
            self.num_train_batch_last_epoch = num_train_batch_last_epoch
            self.num_total_train_batch = num_total_train_batch

        # Process evaluation data
        for mode in ["val", "test"]:
            setattr(self, "num_{}_epoch".format(mode), 1)
            if self.get("{}_data".format(mode)) is not None or self.get(
                    "{}_loader".format(mode)) is not None:
                setattr(
                    self, "num_{}_batch".format(mode),
                    getattr(self, "num_{}_data".format(mode)) //
                    self.cfg.data.batch_size +
                    int(not self.cfg.data.drop_last and bool(
                        getattr(self, "num_{}_data".format(mode)) %
                        self.cfg.data.batch_size)))
            ...

and the fintune and training routine stops at

    def _run_routine(self, ...):
            ...
            # Break in the final epoch
            if self.ctx.cur_mode == 'train' and epoch_i == \
                    self.ctx.num_train_epoch - 1:
                if batch_i >= self.ctx.num_train_batch_last_epoch - 1:
                    break
            ...

The problems are

If we choose test or validate split for training routine, the num_train_batch_last_epoch and num_train_epoch are all wrong(since they are calculated for the training split).
If we set different parameters (say local update steps) for finetune and training, they should have different num_train_batch_last_epoch and num_train_epoch.

Expected behavior The number of batches and epochs should follow the combination of mode and split.

Jul 27 '22 09:07 DavdGao