FederatedScope icon indicating copy to clipboard operation
FederatedScope copied to clipboard

The combination of different mode and split leads to wrong calculation for number of batches and number of epochs

Open DavdGao opened this issue 3 years ago • 0 comments

Describe the bug As the title says, the current number of batches and epochs are calculated for each split as follows:

        ...
        # Process training data
        if self.train_data is not None or self.train_loader is not None:
            # Calculate the number of update steps during training given the
            # local_update_steps
            num_train_batch, num_train_batch_last_epoch, num_train_epoch, \
                num_total_train_batch = self.pre_calculate_batch_epoch_num(
                    self.cfg.train.local_update_steps)

            self.num_train_epoch = num_train_epoch
            self.num_train_batch = num_train_batch
            self.num_train_batch_last_epoch = num_train_batch_last_epoch
            self.num_total_train_batch = num_total_train_batch

        # Process evaluation data
        for mode in ["val", "test"]:
            setattr(self, "num_{}_epoch".format(mode), 1)
            if self.get("{}_data".format(mode)) is not None or self.get(
                    "{}_loader".format(mode)) is not None:
                setattr(
                    self, "num_{}_batch".format(mode),
                    getattr(self, "num_{}_data".format(mode)) //
                    self.cfg.data.batch_size +
                    int(not self.cfg.data.drop_last and bool(
                        getattr(self, "num_{}_data".format(mode)) %
                        self.cfg.data.batch_size)))
            ...

and the fintune and training routine stops at

    def _run_routine(self, ...):
            ...
            # Break in the final epoch
            if self.ctx.cur_mode == 'train' and epoch_i == \
                    self.ctx.num_train_epoch - 1:
                if batch_i >= self.ctx.num_train_batch_last_epoch - 1:
                    break
            ...

The problems are

  • If we choose test or validate split for training routine, the num_train_batch_last_epoch and num_train_epoch are all wrong(since they are calculated for the training split).
  • If we set different parameters (say local update steps) for finetune and training, they should have different num_train_batch_last_epoch and num_train_epoch.

Expected behavior The number of batches and epochs should follow the combination of mode and split.

DavdGao avatar Jul 27 '22 09:07 DavdGao