Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

Fix Bug: Configuring Datasets with train-data-path, valid-data-path, test-data-path

Open Eisenhower opened this issue 8 months ago • 1 comments

Fixed the bug that prevents configuring datasets using train-data-path, valid-data-path, and test-data-path.

When the --split parameter is not configured, the --split parameter will be set to the default value 969, 30, 1. In the blended_megatron_dataset_config.py file, within the post_init function, the following code will raise an error when configuring datasets using train-data-path, valid-data-path, and test-data-path because the split parameter is not None:

if self.blend_per_split is not None and any(self.blend_per_split): assert self.blend is None, "blend and blend_per_split are incompatible" assert self.split is None, "split and blend_per_split are incompatible"

Eisenhower avatar May 27 '24 11:05 Eisenhower