Any interest in adding ML Perf Llama 3 8B to TorchTitan models ?
It will be great to have MLPerf LLama 3 pre-training working OOB with TorchTitan, Here are some references on that .
@tianyu-l , @wconstab , any thoughts on this ? Having such a model OOB would greatly facilitates quick validation and comparison or full HW and SW stack.
Having such a model OOB would greatly facilitates quick validation and comparison or full HW and SW stack.
sorry, could you tell a more complete story? Not clear to me what's going on.
@tianyu-l , a quick summary of MLPerf follows.
MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. MLPerf is an industry-standard benchmark suite from MLCommons for measuring the performance of machine learning (ML) hardware, software, and cloud platforms. It provides standardized, apples-to-apples comparisons across different ML tasks, covering both the training of AI models and the inference (or running) of those models on a wide range of systems, from data centers to mobile devices. The benchmarks are a result of a consortium of AI industry leaders, researchers, and academics.
MLPerf Training v5.1 introduces a benchmark based on Meta’s Llama 3.1 8B, replacing BERT with a modern model that can still run on single-node systems. This new pretraining benchmark based on Meta’s Llama 3.1 8B model brings modern LLM pretraining evaluation within reach of more organizations. It was intentionally designed to be easy to set up and run on small to moderately sized computational resources. By requiring only a single node, using a subset of the C4 dataset, and starting from random weights, it lowers the barrier to entry while maintaining relevance to current AI development practices.
@tianyu-l , does that give you enough context about MLPerf ? I am only proposing support for Llama3 8B MLPerf pre-training case - not any of the other MLPerf case at this point.
I still don't know what's the point of supporting MLPerf version of Llama3 8B, how's it different from current Llama 3 8B. Did you mention it anywhere?
@tianyu-l , I did try it out. Works with some changes.
- The c4 en dataset , training and validation, is prepared slightly differently.
- The dataset has to be added to the file "text_datasets.py" as described in datasets.md
- Log perplexity ( target = 3.3) needs to be added to track accuracy.
@tianyu-l , please let me know your thoughts. I am planning to do a PR.
IIUC what you're proposing is to change some things in Torchtitan so that you can directly run the MLPerf llama experiment using unchanged torchtitan. Is that right? Well, it depends what the changes are!
Can you give more information about the proposed changes before you submit a PR? (just to save yourself time in case we don't like the changes).
For example
The c4 en dataset , training and validation, is prepared slightly differently.
What is the difference? What is the motivation?
...
@githubsgi I tried asking what's the motivation and benefit but I have NOT got an answer from you.
@tianyu-l , please see if this makes sense to you. As you pointed out, integrating the MLPerf Llama3 8B model into TorchTitan would allow running a widely used performance benchmark out of box (OOB). TorchTitan's hardware agnostic design enables new accelerators to be onboarded with minimal friction, making it an ideal testbed for emerging hardware. By incorporating the MLPerf Llama3 8B benchmark, we create a OOB reference point that allows new accelerators to be rapidly validated against a well-established industry benchmark quickly.
The benefits extend beyond just hardware validation. MLPerf Llama3 8B submissions provide a comprehensive ecosystem of reference results, detailed logs, convergence metrics, and performance baselines across multiple accelerators from multiple vendors. This wealth of data serves as a crucial calibration tool for new accelerators, enabling engineers to quickly identify performance gaps, correctness issues, optimization opportunities, etc.
I think what would help me is to understand the finer grain proposal and its tradeoffs. For example, you mentioned you want to change something about the dataset. Why, and with what implications? That level of detail for each change would be easier to discuss.
@wconstab , there are essentially 2 minor differences that I can see as far as data prep goes . 1. A download script is provided by MLCommons for downloading the C4 dataset from their site. 2. Another script is provided to merge the training set into 8 files (e.g. c4-train.en_0.json.gz ) and validation set into 1 file ( e.g. c4-validation.en.json.gz ) . There is a third step for converting to Megatron formats - I do not think these formats are supported and relevant (?). These 3 steps are described here.
@wconstab , any comment ?
Also looked into the loss function for Llama3 8B. It is Cross Entropy, hence perplexity = 2**loss. So, a perplexity of 3.3 is 1.72246602447 of loss.
Hence , if the above is correct, then the only thing that is needed for an MLPerf compatible run is the data prep !
@wconstab and @tianyu-l , any comments on the above ? Do you see benefits in adding a condensed version of this issue in the README page ?