Stas Bekman comments

Results 664 comments of


                                            Stas Bekman

BF16 training benchmarks.

I found time to write a simple comparison repro and the outcome varies very slightly: ``` $ deepspeed --num_gpus 1 train-amp-vs-deepspeed.py [...] deepspeed loss.item()=12.141949653625488 torch.amp loss.item()=12.1455078125 ``` The script: ```...

various mpi4py issues and solutions

actually there is another problem here. if I install `mpi4py` from pypi: ``` pip install mpi4py>=4.0.0 ``` trt-llm fails to load: ``` python -c "import tensorrt_llm" *** An error occurred...

various mpi4py issues and solutions

after debugging this - as I'm running in a SLURM env it appear to be something triggered by `SLURM_NODELIST` being present - pip version of `mpi4py` relies on system-wide mpi...

MAMF - GH200

Thank you, @frankschae! But the 2nd result is worse than the first one. H* performs the best at bigger dimensions - please see: the H100 entry at https://github.com/stas00/ml-engineering/tree/master/compute/accelerator/benchmarks#examples-of-usage - surely...

MAMF - GH200

Much better - H200 has a faster HBM so we should expect higher matmal TFLOPS When finished if it resonates please make a PR to add a new entry at...

MAMF - GH200

> I'm not opening another PR as there is already @yaolu's PR #60. I added you to @yaolu's PR :) > Btw, @stas00 do you want to change https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/benchmarks/mamf-finder.py#L276C21-L276C48 so...

Stas Bekman

BF16 training benchmarks.

various mpi4py issues and solutions

various mpi4py issues and solutions

MAMF - GH200

MAMF - GH200

MAMF - GH200

MAMF - GH200

[REQUEST] Switch to a unified process group management framework for DP/TP/PP/EP/SP

[REQUEST] Switch to a unified process group management framework for DP/TP/PP/EP/SP

[REQUEST] Switch to a unified process group management framework for DP/TP/PP/EP/SP