multilingual-modeling
multilingual-modeling copied to clipboard
Adding Language specific validation sets to deepspeed
The idea of this issue to modify the megatron-deepspeed repository code that we use for training all models. In order to track the progress of validation loss on several validaiton sets separately. This would allow us to track the progress of training independtly on separate languages.
Currently, the validation loss is calculated on a single validation set that includes the same language combination as the training data. (see here 13B param model training on tensorboard)

Useful pointers
- How datasets are loaded in model pre-training here
- Dataset loader for GPT here
- Validation step execution here
Progress
- Forked deepspeed where all development happens (ask @hadyelsahar for invitation) here
- Pull request: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/97
I can review/implement this part.
My current understanding is that in training.py , the train, validation, and test datasets are loaded from a function build_train_valid_test_data_iterators.
https://github.com/hadyelsahar/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/megatron/training.py#L123-L136
Evaluation is then done here, both for valid_data_iterator and test_data_iterator.
https://github.com/hadyelsahar/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/megatron/training.py#L152-L166
We could set
and call evaluate_and_print_results iteratively for each language.
for each_language_data_loader in valid_data_iterator:
evaluate_and_print_results(
prefix, forward_step_func,
each_language_data_loader,
model,
eval_metric
)
Some modification to evaluate_and_print_results will be required so that we save each validation metric for each language.
Currently the code base yields 1 single validation / test sets. There’s no support of adding args for the specifications of the multiple validation datasets.
my adhoc solution is to add an extra argument:
--extra-valid-data-path [EXTRA_VALID_DATA_PATH ...]
Path to extra validation dataset to be monitored during trainingAccepted format:
1) a single data path,
2) multiple datasets in the form:data1-weight data1-path data2-path data2-weight yielding single validation set
3) allow multiple validation sets by multiple (2) separated by commas in the form: data1-weight data1-path data2-weight data2-path, data3-weight3 data3-path data3-weight data3-path ...
The idea here is to allow mixing different validation sets on the fly
python pretrain_gpt2.py. …. --extra-valid-data-path. 0.5 en_data, 0.5 fr_data, 0.33 rare1_data 0.33 rare2_data 0.33 rare3_data
any thoughts about a better design?
work in progress PR sent here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/97