llama2-webui
llama2-webui copied to clipboard
Supports accepting network requests, listening on specific ports and running GPTQ models on multiple GPUs
If multiple GPUs are used to run the GPTQ model, memory would only be allocated on the first GPU, resulting in an error due to the inability to allocate more memory. This pr solves this problem. Also allow listening for network requests on specific ports, which is a necessary feature since the deployment environment is likely to not have a graphical interface.