llama
llama copied to clipboard
does anyone did with a single RTX 3070 Ti 8Gb?
I've tried even with int8 but yet cuda out of memory. maybe int4? lol
Not gonna lie, 8GB VRAM is probably not enough to get anything with reasonable speed. You probably can get it running on this but it will be quite slow. Some people are using cloud based solutions such as Google Colab Pro+. I personally use a Shaddow PC #105 as I can also use it for other things such as gaming.
Ideal is 16GB RAM + 16GB VRAM. Then it should run no problems.
However, if you just want to get it running and don't care much about speed, then just stick around as people are making more solutions for this every day. 😀
Just found a solution with PyArrow. I finally made it with 7B model. I have 32gb RAM and 8gb VRAM. But unfortunately, the results are literally non-sense lol. There's something strange happening right now. It just ran one time and now i got an error.
Loading checkpoint Loading tokenizer normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization. Loading model Loaded in 12.40 seconds flayers: 100%|███████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 14.57it/s] forward: 0%| | 0/504 [00:02<?, ?it/s] Traceback (most recent call last): File "/home/felipehime/venatu/llama/example.py", line 110, in <module> fire.Fire(main('//home/felipehime/models-llama/7B', File "/home/felipehime/venatu/llama/example.py", line 95, in main results = generator.generate( File "/home/felipehime/venatu/llama/llama/generation.py", line 49, in generate next_token = sample_top_p(probs, top_p) File "/home/felipehime/venatu/llama/llama/generation.py", line 90, in sample_top_p next_token = torch.multinomial(probs_sort, num_samples=1) RuntimeError: probability tensor contains either
inf,
nan` or element < 0
(venatu) felipehime@felipehime:~/venatu$ ^C
(venatu) felipehime@felipehime:~/venatu$
turn it off and on again? 😀
I did it here: https://github.com/juncongmoo/pyllama
Well I got it but the results are completly non-sense even with example "I believe the meaning of life is"
You can experiment with 4 bits from here:
https://github.com/qwopqwop200/GPTQ-for-LLaMa
Yes I did, but you need a lot of RAM. https://github.com/facebookresearch/llama/issues/79#issuecomment-1460464011