Sean Owen

Results 245 comments of Sean Owen

I don't think you necessarily have to set that. Are you seeing an issue? Otherwise you typically set this to the name of the repeated module in the transformer architecture....

You ran out of disk space, that's all. Where did you save stuff? sometimes your local root volume is small, and most of the storage is in mounted EBS volumes.

These models are far too large to run reasonably on CPUs, yes. You need an NVIDIA GPU, so this won't work on Macs. See https://github.com/databrickslabs/dolly/issues/67 for some attempts to get...

See guidance here: https://github.com/databrickslabs/dolly#v100-gpus-1 p3dn.24xlarge (32GB V100s) would be better. That's a 16GB V100. You may be able to make it work by configuring optimizer offload _and_ turning down batch...

16GB is small for training, yeah. You can try param offload too. But then it'll be slower. You want bigger GPUs - maybe g5 / A10?

Never seen that one - are there other errors? this is just saying "something went wrong". Make sure you made all the settings in the notebook, and suggest configurations. It...

OOM on the GPU or VM? You are following https://github.com/databrickslabs/dolly#a10-gpus-1 right? That instance was enough IIRC to train 7B, or at least it started to. If it fails late in...

Oh, you are trying to train on a g5.4xlarge? I misread, crossed it with the OP's thread. That's too little mem. You want a multi-GPU setup with more mem like...

Yep, doesn't look right. Not sure how you have set up the instance, so pretty hard to debug. Probably mismatched versions of CUDA libraries.

I think the problem is very long inputs. You can filter them or use a smaller batch size, indeed. I'll have to change guidance if 3 isn't working. yeah you...