llama icon indicating copy to clipboard operation
llama copied to clipboard

Anyone able to run 7B on google colab?

Open andrewmlu opened this issue 2 years ago • 9 comments

Interested to see if anyone is able to run on google colab. Seems like 16 GB should be enough and is granted often for colab free. Not sure if Colab Pro should do anything better, but if anyone is able to, advice would be much appreciated.

andrewmlu avatar Mar 05 '23 03:03 andrewmlu

Not for free users. The model must be loaded into the CPU memory (if we are talking about this repo) but Colab only provides a capacity of less than 13G. Therefore, it cannot be run on the free version of Colab.

However, it may work if you have a Pro subscription. I look forward to hearing about your outcome.

reycn avatar Mar 05 '23 07:03 reycn

What about loading it onto the TPU? KoboldAI can load 20B models (Like Erebus 20B) just fine to TPU, barring the fact it takes about 15 minutes to load a 40~ GB model, which Erebus specifically is split into 23 parts, though once running, they're pretty fast. Maybe there's a way to reduce the per-segment size from 15~ GB down to 2 GB, by splitting it into smaller parts, the same way Erebus is.

Daviljoe193 avatar Mar 05 '23 08:03 Daviljoe193

I got it to run on a Shadow PC #105 which has only 12GB RAM. So it crunched the page file a fair bit. But still loaded the model in about 110 seconds. As it clears the RAM after moving it to the GPU.

So it can work on a computer with less than 14GB RAM, but perhaps Google Colab doesn't have a page file? IDK.

elephantpanda avatar Mar 05 '23 09:03 elephantpanda

It's possible to make it work on the free version. Since colab gives you more GPU VRAM than RAM, what you'll want to do is load the checkpoint into CUDA rather than CPU. Once you've done that, split the state dict on the layers, save the sharded state dict, and then after freeing your GPU memory (or in another run) sequentially load each shard into the model on the GPU afterward, making sure to delete each shard once you're done. You'll save quite a bit of RAM during the loading process, and from there it should work.

brendan-donohoe avatar Mar 05 '23 21:03 brendan-donohoe

Update: able to run on google colab pro. Seems 12 GB of ram is the issue.

andrewmlu avatar Mar 05 '23 21:03 andrewmlu

It's possible to make it work on the free version. Since colab gives you more GPU VRAM than RAM, what you'll want to do is load the checkpoint into CUDA rather than CPU. Once you've done that, split the state dict on the layers, save the sharded state dict, and then after freeing your GPU memory (or in another run) sequentially load each shard into the model on the GPU afterward, making sure to delete each shard once you're done. You'll save quite a bit of RAM during the loading process, and from there it should work.

I attempted loading on GPU, and still it is unable to fully load. CUDA out of memory.

andrewmlu avatar Mar 05 '23 21:03 andrewmlu

Here's a notebook that goes through the steps I just mentioned and works for me using colab pro's standard GPU (~15 GB VRAM) and regular RAM runtime (~12.7 GB RAM), which I think is identical to the free version but I'm not completely certain. If the free colab gives less VRAM than the pro standard, it may indeed be impossible, but it should at least use compute units more efficiently on pro:

https://pastebin.com/Le2zaJCy

This uses a 15 GB T4 GPU. If you have colab pro, there's an option to run 13B that should work as well, though you'll have to be patient executing the second cell. Colab is slow to save files, so you may have to wait and check your drive to make sure that everything has saved as it should before proceeding.

brendan-donohoe avatar Mar 05 '23 21:03 brendan-donohoe

I've gotten this one notebook from a 4chan user to work for me on the free tier. It's VERY cumbersome to get working, but it does work. All I changed when I ran it was to not use Google Drive, and instead get the model from somebody who mirrored the model on Huggingface (Brave soul, but the model got flagged and is probably gonna vanish from there). It splits the model like I mentioned, so again, maybe if somebody could get it working with a TPU, and split the models like this notebook does, then maybe the higher parameter models could be workable without needing a Colab Pro subscription.

Daviljoe193 avatar Mar 06 '23 11:03 Daviljoe193

I was able to run the model on Colab Pro. It took 27gb RAM for me.

For this recommend you to switch on TPU (there are 35gb RAM). And add "low_cpu_mem_usage = True" in "from_pretrained".

usmanovaa avatar Mar 22 '23 14:03 usmanovaa

Interested to see if anyone is able to run on google colab. Seems like 16 GB should be enough and is granted often for colab free. Not sure if Colab Pro should do anything better, but if anyone is able to, advice would be much appreciated.

i was able to run it in normal colab but it is horribly slow because the model is linked to g drive can anyone help me make the loading times and the time the model takes to type out the response faster?

link to code or commands becasue it is a linux environment -- https://colab.research.google.com/drive/1otfwOihFBtNznj7ZXqiUJV_OXPm_BnN3?usp=sharing

zeeboi9 avatar May 07 '23 15:05 zeeboi9

I am writing this a few months later, but its easy to run the model if you use llama cpp and a quantized version of the model. You can even run a model over 30b if you did. You don't even need colab. On my phone, its possible to run a 3b model and it outputs 1 token or half per second which is slow but pretty surprising its working on my phone!

johnwick123f avatar Jul 13 '23 17:07 johnwick123f

I am writing this a few months later, but its easy to run the model if you use llama cpp and a quantized version of the model. You can even run a model over 30b if you did. You don't even need colab. On my phone, its possible to run a 3b model and it outputs 1 token or half per second which is slow but pretty surprising its working on my phone!

I'm doing some edge computing research, mind if i ask how do you run it on the phone?

liushiyi1994 avatar Jul 21 '23 15:07 liushiyi1994

I am writing this a few months later, but its easy to run the model if you use llama cpp and a quantized version of the model. You can even run a model over 30b if you did. You don't even need colab. On my phone, its possible to run a 3b model and it outputs 1 token or half per second which is slow but pretty surprising its working on my phone!

I'm doing some edge computing research, mind if i ask how do you run it on the phone?

llama.cpp supports Android. Ref: https://github.com/ggerganov/llama.cpp#android

windmaple avatar Aug 08 '23 02:08 windmaple