text-generation-inference
                                
                                 text-generation-inference copied to clipboard
                                
                                    text-generation-inference copied to clipboard
                            
                            
                            
                        Question: How to estimate memory requirements for a certain batch size/
I was just wondering how the GPU memory requirements vary depending on model size/batch size of request/max tokens. In doing some experiments where I needed the server to keep running for a long time, I found that it often ran out of memory and shut down - is there a way to estimate the memory footprint based on these variables?
Unfortunately not at the moment. https://github.com/huggingface/text-generation-inference/issues/478 might help memory.
Other than that --max-total-batch-tokens  is really the variable you need to set to control the amount of memory your going to need.
text-generation-launcher --help for further information and other control variables. Other variables should help too.
You can use the benchmarking tool to make sure that you don't OOM at a given setting and then use these settings at maximum values in the launcher.
Thank you all, I'll give these approaches a shot.