Olatunji Ruwase comments

Results 648 comments of


                                            Olatunji Ruwase

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

> However, with that change I get the following error on the `self.actor_model.empty_partition_cache()` call: > > ``` > File "/path/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 358, in release_and_reset_all > raise RuntimeError(f"param {param.ds_summary()} still in...

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

> Should it be `lora_param` instead of `lora_params`? Maybe change this to: > > ``` > if len(lora_param) == 3: > ``` I think you have found a bug here....

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

Thanks for sharing these details. I agree that `empty_partition_cache` needs a `wait_on_inflight_params` logic like you discovered. However, I would like to take a step back to understand a few things....

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

@adammoody, kudos on the intensive debugging. I think I know what might be wrong, but I need your help to confirm. I have updated my PR with some asserts to...

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

Thanks for sharing these updates. Adding the second assert for the actor model cache is a really good idea. It is mystery why it fails. This supports your suspicion of...

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

Also, can you try dropping `--enable_hybrid_engine` from your command line?

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

Thanks for the update. 1. Hitting the assertion is good since it stops at the earliest violation of the invariant of `empty_partition_cache()`. 2. It is good to know that `--enable_hybrid_engine`...

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

@adammoody, FYI I think this [DeepSpeed PR](https://github.com/microsoft/DeepSpeed/pull/3380) from my colleague @HeyangQin might be relevant here. Please give him a bit more time to get it ready.

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

> @tjruwase , I think I found the cause. > > I believe the problem is that all four models share the same ReLU module object. Each model registers a...

single gpu 6.7b lora CUDA OOM with A6000 48G

Please try adding `--offload_reference_model` to command line.