multilingual-modeling icon indicating copy to clipboard operation
multilingual-modeling copied to clipboard

Adding Language specific validation sets to deepspeed

Open hadyelsahar opened this issue 4 years ago • 4 comments

The idea of this issue to modify the megatron-deepspeed repository code that we use for training all models. In order to track the progress of validation loss on several validaiton sets separately. This would allow us to track the progress of training independtly on separate languages.

Currently, the validation loss is calculated on a single validation set that includes the same language combination as the training data. (see here 13B param model training on tensorboard)

image

Useful pointers

  • How datasets are loaded in model pre-training here
  • Dataset loader for GPT here
  • Validation step execution here

Progress

  • Forked deepspeed where all development happens (ask @hadyelsahar for invitation) here
  • Pull request: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/97

hadyelsahar avatar Sep 08 '21 10:09 hadyelsahar

I can review/implement this part.

sbmaruf avatar Sep 08 '21 12:09 sbmaruf

My current understanding is that in training.py , the train, validation, and test datasets are loaded from a function build_train_valid_test_data_iterators.

https://github.com/hadyelsahar/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/megatron/training.py#L123-L136

Evaluation is then done here, both for valid_data_iterator and test_data_iterator.

https://github.com/hadyelsahar/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/megatron/training.py#L152-L166

We could set

and call evaluate_and_print_results iteratively for each language.

for each_language_data_loader in valid_data_iterator:
    evaluate_and_print_results(
        prefix, forward_step_func, 
        each_language_data_loader, 
        model, 
        eval_metric
    )

Some modification to evaluate_and_print_results will be required so that we save each validation metric for each language.

lintangsutawika avatar Sep 08 '21 13:09 lintangsutawika

Currently the code base yields 1 single validation / test sets. There’s no support of adding args for the specifications of the multiple validation datasets.

my adhoc solution is to add an extra argument:

  --extra-valid-data-path [EXTRA_VALID_DATA_PATH ...]
Path to extra validation dataset to be monitored during trainingAccepted format: 
1) a single data path, 
2) multiple datasets in the form:data1-weight data1-path data2-path data2-weight yielding single validation set 
3) allow multiple validation sets by multiple (2) separated by commas in the form: data1-weight data1-path data2-weight data2-path, data3-weight3 data3-path data3-weight data3-path ...

The idea here is to allow mixing different validation sets on the fly

python pretrain_gpt2.py. …. --extra-valid-data-path. 0.5 en_data, 0.5 fr_data, 0.33 rare1_data 0.33 rare2_data 0.33 rare3_data

any thoughts about a better design?

hadyelsahar avatar Sep 13 '21 14:09 hadyelsahar

work in progress PR sent here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/97

hadyelsahar avatar Sep 14 '21 01:09 hadyelsahar