Sylvain Gugger
Sylvain Gugger
I can't reproduce on my side as the prompt yielded in the batches make Accelerate fail: since this is is an iterable dataset, `dispatch_batches` is activated by default and the...
Could you share a minimal sample of code reproducing the error please?
@muellerzr the problem is in the forward though ;-) And it should work for training as long as there is no offload.
@ananda1996ai First note that you cannot use data parallel in conjunction with model parallelism, so num_processes in your config needs to be 1. I cannot reproduce the error, could you...
Oh the problem is quite clear then, the process only sees GPU 7. I think it all stems from the fact that you use `num_processes=2` in your accelerate config.
Yes it does. Since you're not describing the problem you encountered, I'm not sure how we can help. You still need to have enough RAM to load the checkpoint shards...
Yes but it will be very slow unless you have a very fast hard drive. You will also need to limit the RAM used by the first models (since Accelerate...
That's a problem between PyTorch and xformer. You should report the issue on their repos :-)