Logan Adams
Logan Adams
> @loadams @mrwyattii @lekurile can this change be merged? Sorry for the delay @nelyahu - I will make sure this is merged shortly.
@Chandler-Bing - are you able to test with any versions in between 0.10.2 and 0.13.5? Could you test with 0.11.x or 0.12.x?
@Chandler-Bing - thanks, the changelog between 0.12.3 and 0.12.4 is fairly small: https://github.com/microsoft/DeepSpeed/compare/v0.12.3...v0.12.4 I'll have to take a closer look, but if you're able to easily git bisect/binary search those...
> Hello, I have the same issue. When training on the A100, everything operates normally during MLLM stage1. However, during stage2, ds 0.9.5 functions correctly, but version 0.14.0 does not....
Hi, can you please add a title?
> Hi @XuehaiPan - thank you for the contribution. If I recall correctly, we had to use `pynvml` because we were getting inaccurate memory information from `torch` in some scenarios....
@mrwyattii and @cmikeh2 - do we expect these to work on AMD and thoughts on the change?
@rraminen - AMD isn't currently supported in FastGen, so does it make sense to merge this PR with later support for when that comes in? Since for now, this won't...
FYI - @rraminen and @jithunnair-amd. @Hobbes-Le-Chat - could you let us know what version of DeepSpeed you are using/share your ds_report?
@Hobbes-Le-Chat - does this mean that you do not see the error when removing the `DS_BUILD_RANDOM_LTD=1`? And if you can just copy/paste the whole output from your console, that should...