llama icon indicating copy to clipboard operation
llama copied to clipboard

does anyone did with a single RTX 3070 Ti 8Gb?

Open felipehime opened this issue 1 year ago • 7 comments

I've tried even with int8 but yet cuda out of memory. maybe int4? lol

felipehime avatar Mar 07 '23 03:03 felipehime

Not gonna lie, 8GB VRAM is probably not enough to get anything with reasonable speed. You probably can get it running on this but it will be quite slow. Some people are using cloud based solutions such as Google Colab Pro+. I personally use a Shaddow PC #105 as I can also use it for other things such as gaming.

Ideal is 16GB RAM + 16GB VRAM. Then it should run no problems.

However, if you just want to get it running and don't care much about speed, then just stick around as people are making more solutions for this every day. 😀

elephantpanda avatar Mar 07 '23 04:03 elephantpanda

Just found a solution with PyArrow. I finally made it with 7B model. I have 32gb RAM and 8gb VRAM. But unfortunately, the results are literally non-sense lol. There's something strange happening right now. It just ran one time and now i got an error.

Loading checkpoint Loading tokenizer normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization. Loading model Loaded in 12.40 seconds flayers: 100%|███████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 14.57it/s] forward: 0%| | 0/504 [00:02<?, ?it/s] Traceback (most recent call last): File "/home/felipehime/venatu/llama/example.py", line 110, in <module> fire.Fire(main('//home/felipehime/models-llama/7B', File "/home/felipehime/venatu/llama/example.py", line 95, in main results = generator.generate( File "/home/felipehime/venatu/llama/llama/generation.py", line 49, in generate next_token = sample_top_p(probs, top_p) File "/home/felipehime/venatu/llama/llama/generation.py", line 90, in sample_top_p next_token = torch.multinomial(probs_sort, num_samples=1) RuntimeError: probability tensor contains either inf, nan` or element < 0 (venatu) felipehime@felipehime:~/venatu$ ^C (venatu) felipehime@felipehime:~/venatu$

felipehime avatar Mar 07 '23 06:03 felipehime

turn it off and on again? 😀

elephantpanda avatar Mar 07 '23 06:03 elephantpanda

I did it here: https://github.com/juncongmoo/pyllama

juncongmoo avatar Mar 07 '23 08:03 juncongmoo

Well I got it but the results are completly non-sense even with example "I believe the meaning of life is"

felipehime avatar Mar 07 '23 08:03 felipehime

You can experiment with 4 bits from here:

https://github.com/qwopqwop200/GPTQ-for-LLaMa

ArakiSatoshi avatar Mar 08 '23 07:03 ArakiSatoshi

Yes I did, but you need a lot of RAM. https://github.com/facebookresearch/llama/issues/79#issuecomment-1460464011

randaller avatar Mar 08 '23 17:03 randaller