FederatedScope
FederatedScope copied to clipboard
The combination of different mode and split leads to wrong calculation for number of batches and number of epochs
Describe the bug As the title says, the current number of batches and epochs are calculated for each split as follows:
...
# Process training data
if self.train_data is not None or self.train_loader is not None:
# Calculate the number of update steps during training given the
# local_update_steps
num_train_batch, num_train_batch_last_epoch, num_train_epoch, \
num_total_train_batch = self.pre_calculate_batch_epoch_num(
self.cfg.train.local_update_steps)
self.num_train_epoch = num_train_epoch
self.num_train_batch = num_train_batch
self.num_train_batch_last_epoch = num_train_batch_last_epoch
self.num_total_train_batch = num_total_train_batch
# Process evaluation data
for mode in ["val", "test"]:
setattr(self, "num_{}_epoch".format(mode), 1)
if self.get("{}_data".format(mode)) is not None or self.get(
"{}_loader".format(mode)) is not None:
setattr(
self, "num_{}_batch".format(mode),
getattr(self, "num_{}_data".format(mode)) //
self.cfg.data.batch_size +
int(not self.cfg.data.drop_last and bool(
getattr(self, "num_{}_data".format(mode)) %
self.cfg.data.batch_size)))
...
and the fintune and training routine stops at
def _run_routine(self, ...):
...
# Break in the final epoch
if self.ctx.cur_mode == 'train' and epoch_i == \
self.ctx.num_train_epoch - 1:
if batch_i >= self.ctx.num_train_batch_last_epoch - 1:
break
...
The problems are
- If we choose test or validate split for training routine, the
num_train_batch_last_epochandnum_train_epochare all wrong(since they are calculated for the training split). - If we set different parameters (say local update steps) for finetune and training, they should have different
num_train_batch_last_epochandnum_train_epoch.
Expected behavior The number of batches and epochs should follow the combination of mode and split.