Stas Bekman
Stas Bekman
I found time to write a simple comparison repro and the outcome varies very slightly: ``` $ deepspeed --num_gpus 1 train-amp-vs-deepspeed.py [...] deepspeed loss.item()=12.141949653625488 torch.amp loss.item()=12.1455078125 ``` The script: ```...
actually there is another problem here. if I install `mpi4py` from pypi: ``` pip install mpi4py>=4.0.0 ``` trt-llm fails to load: ``` python -c "import tensorrt_llm" *** An error occurred...
after debugging this - as I'm running in a SLURM env it appear to be something triggered by `SLURM_NODELIST` being present - pip version of `mpi4py` relies on system-wide mpi...
Thank you, @frankschae! But the 2nd result is worse than the first one. H* performs the best at bigger dimensions - please see: the H100 entry at https://github.com/stas00/ml-engineering/tree/master/compute/accelerator/benchmarks#examples-of-usage - surely...
Much better - H200 has a faster HBM so we should expect higher matmal TFLOPS When finished if it resonates please make a PR to add a new entry at...
> I'm not opening another PR as there is already @yaolu's PR #60. I added you to @yaolu's PR :) > Btw, @stas00 do you want to change https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/benchmarks/mamf-finder.py#L276C21-L276C48 so...
resolved in https://github.com/stas00/ml-engineering/pull/60
Thank you for starting the discussion, Junjie. This is indeed something we need to solve. I'm not quite sure DeviceMesh is the way to standardize on since: 1. you have...
> mpu provides a series of set_xx APIs for setting world sizes or ranks without touching any created groups. I was wondering what are their primary use cases. Probably to...
Also remember that Deepspeed creates the default group across all gpus, but usually doesn't use it as it then creates new groups which it does use - this wastes probably...