Support Multi-Source Blending in Grain Data Pipeline

Open bzantium opened this issue 7 months ago • 1 comments

Summary: The current data loading pipeline in MaxText utilizing Grain (as configured via dataset_type: grain and related flags/config) primarily supports loading data from a single source path or dataset specification (e.g., grain_train_files). This feature request proposes enhancing this existing pipeline to natively support loading, blending, and processing data records from multiple distinct sources simultaneously. This would enable the definition and efficient handling of complex data mixtures directly within the MaxText training framework, leveraging Grain's underlying capabilities like MapDataset.mix.

Motivation / Use Case: Creating optimal data mixtures by blending multiple corpora (e.g., web text, books, code) with specific sampling ratios is fundamental for LLM pre-training. Fine-tuning also often requires mixing pre-training data with smaller, specialized datasets.

Currently, users of the MaxText Grain pipeline need to perform potentially complex and storage-intensive offline pre-processing to merge diverse datasets into a single input source consumable by MaxText. This lacks flexibility and makes experimenting with different data mixture configurations cumbersome.

Directly supporting multi-source blending within the MaxText pipeline would significantly streamline these common LLM workflows by allowing users to:

Define complex pre-training data mixtures (e.g., {C4: 60%, GitHub: 20%, Books: 20%}) via MaxText configuration. Easily blend pre-training and fine-tuning datasets with controlled ratios during fine-tuning runs. Manage multilingual datasets more effectively.

Apr 20 '25 23:04 bzantium

@aireenmei could you please take a look

May 01 '25 13:05 shralex