stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Errors executing torchrun for train.py on Apple Silicon M2 Pro

Open droptpackets opened this issue 2 years ago • 0 comments

When attempting to do fine tuning, I'm getting the following error in the output: RuntimeError: Distributed package doesn't have NCCL built in

Searching here indicates this is related to CUDA and other NVIDIA GPU related rendering.

So, I added the following snippet to train.py, which is supposed to force CPU only (same workaround used by this user in another meta-related repo: https://github.com/markasoftware/llama-cpu): torch.distributed.init_process_group("gloo")

Now the NCCL error goes away, but I get this error instead: AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

Again, I have no NVIDIA GPU or other such software on my system. I've tried a myriad of workarounds for my apple silicon, but haven't gotten very far.

Anything I'm missing here?

I'm running on python 3.10.9 with all requirements.txt entries installed as required via pip.

droptpackets avatar Mar 26 '23 05:03 droptpackets