Open-Assistant
Open-Assistant copied to clipboard
`Minimum PC Requirement` for running Open-Assistant local instance when it starts working
I am buying a new PC next week and I want to make sure that the local isntance of Open-Assistant will run on my new system.
What configuration should I go for to get it running smoothly in my case?
Here we can define three tier of PCs:
- Top Tier Pro Build : ( $2500 < Budget )
- Medium Tier Enthusiast Build : ( $1000 < Budget < $2500 )
- Begginer Friendly Minimum Requirement Build : ( Budget < $1000 )
If anyone can suggest I am really thankful.
I think the main bottleneck to running the finished product will be having a GPU with sufficient memory to load the model itself. Until a final model architecture is selected it won't be known exactly how much memory is required, so this question isn't really possible to answer yet
I think the main bottleneck to running the finished product will be having a GPU with sufficient memory to load the model itself. Until a final model architecture is selected it won't be known exactly how much memory is required, so this question isn't really possible to answer yet
ok
the goal of the first version is certainly to make a small model, but we don't know yet if it will be small enough at the beginning to fit into a consumer gpu, we may have to rely on the community to further shrink it, so I really wouldn't make hardware decisions based on that. Just buy what you like and eventually, the model will find a way to you :)
I'm closing this as answered, and we will post requirements when we have actually something to require.
@yk Can we use sparse model from https://neuralmagic.com/ to make our model run on consumer PC. This guy from neuralmagic seems to make some crazy claims on sparse learning like training/inferencing GPU class model on a CPU with loosing just few percentage of accuracy. But more experts like you @yk can verify this and give directions. At the time you put the video of your interview with neuralmagic-professor
I tested one of their example notebooks from sparseML library and it ran without any errors, althoug I did not test it agains any benchmarks so I dont know how does it compairs agains a GPU but it has good potential to fit in our usecase.
Or we can build a ML training internet protocol in which multiple PCs can take part in single ongoing ML model trainig process in which a single PC handles only selected few layers and handles out the handles in the inputs outputs to other chained PCs. Like in this diagram :
Here PC No. 1, 3, 5 are training PCs and PC No. 2, 4, 6 are redundancy PCs incase their relative PCs are offline for any reason.
Here PC1 handles layer no. 1 to 50, then PC2 handles layer no. 51 to 100 and so forth and so on.
I dont really know what would be the latency in comparison to datacenter training model but it is better than nothing for a consumer like you and me. may be lets say if the training time is 10x than that of the opnAI training datacenter even than it is good enough.
Its like tor network but for ML training.
We can give each participating PC some reward in terms of higher accuracy inferencing model or may be a sparse model according to the amount of compute power into time he/she has provided for training the model.
Please give a thought about this and the neuralmagic thing and assign any experienced person to it if possible.
12 B model < 40 GB VRAM => Professional GPU with 40 GB VRAM (P100, A100, H100...) 6.9 B model < 24 GB VRAM => Prosumer GPU with 24 GB VRAM (RTX 3090, RTX 4090).
Those are not official numbers, just my calculation: 12 B + 36 layers @ fp16 = 24 GB weights + 7.8 GB activations (3 * 3072 * 12 * 1024 * 2 bytes * 36 layers) + overhead 6.9 B + 30 layers @ fp16 = 14 GB weights + 6.5 GB activations + overhead
@ronaldluc so would this work on mac m2 with 32gb ram?
If you can run other GPT models, should be possible as the underlying architecture is the same as for even the smallest GPT model. I have a friend who struggled to get T5 and BERT running on the M1/2, since the Pytorch emulation (or whatever it is) is not able to accelerate some of the functions/layers used in those models, thus running these functions/layers on the CPU => creating a bottleneck => it's very, very slow.
12 B model < 40 GB VRAM => Professional GPU with 40 GB VRAM (P100, A100, H100...) 6.9 B model < 24 GB VRAM => Prosumer GPU with 24 GB VRAM (RTX 3090, RTX 4090).
Those are not official numbers, just my calculation: 12 B + 36 layers @ fp16 = 24 GB weights + 7.8 GB activations (3 * 3072 * 12 * 1024 * 2 bytes * 36 layers) + overhead 6.9 B + 30 layers @ fp16 = 14 GB weights + 6.5 GB activations + overhead
Very interesting ! now i have a 3090 and a few 3080 available
Could i possibly run the 12B model with a 3090 and 2 3080? looks like it adds up in terms of VRAM available, but is then the load shared in an optimal way? I have 48gb of ram and can increase to 96 if necessary
I can alternatively set up a few computers instead
thanks in advance for the answers already provided and perhaps to come ! Poiro_0