Olatunji Ruwase
Olatunji Ruwase
> However, with that change I get the following error on the `self.actor_model.empty_partition_cache()` call: > > ``` > File "/path/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 358, in release_and_reset_all > raise RuntimeError(f"param {param.ds_summary()} still in...
> Should it be `lora_param` instead of `lora_params`? Maybe change this to: > > ``` > if len(lora_param) == 3: > ``` I think you have found a bug here....
Thanks for sharing these details. I agree that `empty_partition_cache` needs a `wait_on_inflight_params` logic like you discovered. However, I would like to take a step back to understand a few things....
@adammoody, kudos on the intensive debugging. I think I know what might be wrong, but I need your help to confirm. I have updated my PR with some asserts to...
Thanks for sharing these updates. Adding the second assert for the actor model cache is a really good idea. It is mystery why it fails. This supports your suspicion of...
Also, can you try dropping `--enable_hybrid_engine` from your command line?
Thanks for the update. 1. Hitting the assertion is good since it stops at the earliest violation of the invariant of `empty_partition_cache()`. 2. It is good to know that `--enable_hybrid_engine`...
@adammoody, FYI I think this [DeepSpeed PR](https://github.com/microsoft/DeepSpeed/pull/3380) from my colleague @HeyangQin might be relevant here. Please give him a bit more time to get it ready.
> @tjruwase , I think I found the cause. > > I believe the problem is that all four models share the same ReLU module object. Each model registers a...
Please try adding `--offload_reference_model` to command line.