Sylvain Gugger
Sylvain Gugger
cc @muellerzr Could you have a look into it?
> but, even the naive Data Parallel with AllReduce,shouldn't be like this ? `device_map="auto"` is not data parallelism, it's model parallelism (your model is split across the GPUs). It is...
Not sure what the issue is here. You have a new tensor and not corresponding weight in the checkpoint so it does not work.
> this new tensor needs to be initialized randomly during training So make sure you properly intialize that weight in the `_init_weights` function of your custom model.
You also cannot do `to(xxx)` or `cuda()` for a model loaded with `device_map='auto'`. The model will already be loaded on the GPUs you hava available.
Why does it not work? You have enough room to fit all your model on the GPU, so this what `infer_auto_device_map` does.
To train your model using data parallelism and model parallelism, you need to use DeepSpeed or FSDP.
I am not able to reproduce any of the bugs you mention, can you try installing from source?
You forget that one parameter takes 4 bits in space. With 1000*1000*6 set as max space, you cannot fit your whole model. Also it needs to make sure you will...