Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

`Minimum PC Requirement` for running Open-Assistant local instance when it starts working

Open hemangjoshi37a opened this issue 2 years ago • 2 comments

I am buying a new PC next week and I want to make sure that the local isntance of Open-Assistant will run on my new system.

What configuration should I go for to get it running smoothly in my case?

Here we can define three tier of PCs:

  1. Top Tier Pro Build : ( $2500 < Budget )
  2. Medium Tier Enthusiast Build : ( $1000 < Budget < $2500 )
  3. Begginer Friendly Minimum Requirement Build : ( Budget < $1000 )

If anyone can suggest I am really thankful.

hemangjoshi37a avatar Jan 08 '23 04:01 hemangjoshi37a

I think the main bottleneck to running the finished product will be having a GPU with sufficient memory to load the model itself. Until a final model architecture is selected it won't be known exactly how much memory is required, so this question isn't really possible to answer yet

olliestanley avatar Jan 08 '23 10:01 olliestanley

I think the main bottleneck to running the finished product will be having a GPU with sufficient memory to load the model itself. Until a final model architecture is selected it won't be known exactly how much memory is required, so this question isn't really possible to answer yet

ok

hemangjoshi37a avatar Jan 08 '23 10:01 hemangjoshi37a

the goal of the first version is certainly to make a small model, but we don't know yet if it will be small enough at the beginning to fit into a consumer gpu, we may have to rely on the community to further shrink it, so I really wouldn't make hardware decisions based on that. Just buy what you like and eventually, the model will find a way to you :)

I'm closing this as answered, and we will post requirements when we have actually something to require.

yk avatar Jan 10 '23 21:01 yk

@yk Can we use sparse model from https://neuralmagic.com/ to make our model run on consumer PC. This guy from neuralmagic seems to make some crazy claims on sparse learning like training/inferencing GPU class model on a CPU with loosing just few percentage of accuracy. But more experts like you @yk can verify this and give directions. At the time you put the video of your interview with neuralmagic-professor I tested one of their example notebooks from sparseML library and it ran without any errors, althoug I did not test it agains any benchmarks so I dont know how does it compairs agains a GPU but it has good potential to fit in our usecase.

hemangjoshi37a avatar Jan 10 '23 21:01 hemangjoshi37a

Or we can build a ML training internet protocol in which multiple PCs can take part in single ongoing ML model trainig process in which a single PC handles only selected few layers and handles out the handles in the inputs outputs to other chained PCs. Like in this diagram : photo_2023-01-11_03-22-51

Here PC No. 1, 3, 5 are training PCs and PC No. 2, 4, 6 are redundancy PCs incase their relative PCs are offline for any reason.

Here PC1 handles layer no. 1 to 50, then PC2 handles layer no. 51 to 100 and so forth and so on.

I dont really know what would be the latency in comparison to datacenter training model but it is better than nothing for a consumer like you and me. may be lets say if the training time is 10x than that of the opnAI training datacenter even than it is good enough.

Its like tor network but for ML training.

We can give each participating PC some reward in terms of higher accuracy inferencing model or may be a sparse model according to the amount of compute power into time he/she has provided for training the model.

Please give a thought about this and the neuralmagic thing and assign any experienced person to it if possible.

hemangjoshi37a avatar Jan 10 '23 22:01 hemangjoshi37a

12 B model < 40 GB VRAM => Professional GPU with 40 GB VRAM (P100, A100, H100...) 6.9 B model < 24 GB VRAM => Prosumer GPU with 24 GB VRAM (RTX 3090, RTX 4090).

Those are not official numbers, just my calculation: 12 B + 36 layers @ fp16 = 24 GB weights + 7.8 GB activations (3 * 3072 * 12 * 1024 * 2 bytes * 36 layers) + overhead 6.9 B + 30 layers @ fp16 = 14 GB weights + 6.5 GB activations + overhead

ronaldluc avatar Apr 16 '23 12:04 ronaldluc

@ronaldluc so would this work on mac m2 with 32gb ram?

r-bielawski avatar Apr 18 '23 23:04 r-bielawski

If you can run other GPT models, should be possible as the underlying architecture is the same as for even the smallest GPT model. I have a friend who struggled to get T5 and BERT running on the M1/2, since the Pytorch emulation (or whatever it is) is not able to accelerate some of the functions/layers used in those models, thus running these functions/layers on the CPU => creating a bottleneck => it's very, very slow.

ronaldluc avatar Apr 19 '23 14:04 ronaldluc

12 B model < 40 GB VRAM => Professional GPU with 40 GB VRAM (P100, A100, H100...) 6.9 B model < 24 GB VRAM => Prosumer GPU with 24 GB VRAM (RTX 3090, RTX 4090).

Those are not official numbers, just my calculation: 12 B + 36 layers @ fp16 = 24 GB weights + 7.8 GB activations (3 * 3072 * 12 * 1024 * 2 bytes * 36 layers) + overhead 6.9 B + 30 layers @ fp16 = 14 GB weights + 6.5 GB activations + overhead

Very interesting ! now i have a 3090 and a few 3080 available

Could i possibly run the 12B model with a 3090 and 2 3080? looks like it adds up in terms of VRAM available, but is then the load shared in an optimal way? I have 48gb of ram and can increase to 96 if necessary

I can alternatively set up a few computers instead

thanks in advance for the answers already provided and perhaps to come ! Poiro_0

Poiro0 avatar Jun 13 '23 09:06 Poiro0