optimum-benchmark Training benchmarks reproduction

The training benchmark link no longer works : https://huggingface.co/blog/huggingface-and-optimum-amd

How can one test training throughput on AMD these days? Also, can you provide details about the experiments in the figure below: what ctx length, is this a lora?, how can you have a ddp=2 with 1xMI250, ... Screenshot 2024-04-30 at 16 37 35

Apr 30 '24 14:04 staghado

optimum-benchmark is in constant change, you can find the configs that were used in https://github.com/huggingface/optimum-benchmark/tree/0.0.1/examples/training-llamas same thing for inference, there are many good examples, but maintaining them with the speed of development of everything in the ecosystem is time consuming, so we removed them for the time being.

with v0.0.1 you can only run the benchmarks from cli and results will be in the corresponding folder.
with main you can write the same benchmark using the python api and interact with your benchmark configs/reports more freely.
the ctx length is 256 reported here along all the benchmarking details https://github.com/huggingface/optimum-benchmark/blob/0.0.1/examples/training-llamas/configs/base.yaml#L22
ddp=2 on mi250 is possible because one mi250 chip is seen as 2 cuda devices (explained in the blogpost)
yes it is a lora, that's what the keyword peft means.

Apr 30 '24 14:04 IlyasMoutawwakil

thanks for the prompt response 😄 I totally understand the need for quick development. did you try any large scale training on AMD? i don't know if that's the goal of optimum but still would be cool to know. I am asking because I am looking for a suitable codebase to benchmark some training on AMD(not LoRA).

Apr 30 '24 15:04 staghado

@staghado sorry for the late response, I haven't been working on optimum-benchmark lately, you can check the new work in https://huggingface.co/blog/huggingface-amd-mi300 the goal of optimum-benchmark is to allow you to easily get metrics like training throughput, memory consumption, whether the training is possible, etc, quickly and without needing to set up the data+training pipeline. you can also compare diff config and find the one that your machine can handle or that that matches the topology of your machines most (like which tp/dp degree to use).

May 27 '24 07:05 IlyasMoutawwakil

optimum-benchmark optimum-benchmark copied to clipboard

Training benchmarks reproduction

optimum-benchmark
optimum-benchmark copied to clipboard