Alexandre Strube
Alexandre Strube
As @tarangill said, you have `~/.cache/huggingface/hub` where the models end up. I will close this issue as it's pretty old and I think you found your models by now :-)
> --share It doesn't work for the gradio_web_server.py: ``` 2023-06-29 12:03:00 | INFO | gradio_web_server | args: Namespace(host='0.0.0.0', port=None, share=True, controller_url='http://localhost:21001', concurrency_count=10, model_list_mode='reload', moderate=False, add_chatgpt=False, add_claude=False, add_palm=False, gradio_auth_path=None) 2023-06-29 12:03:00...
It’s a problem on gradio. We have to post on their repository.
Exactly. That's a problem on Gradio, and should be reported there.
As the OP moved on, I will close this one. If anyone feels like this is not a good solution, please reopen.
@Halflifefa this has to do with the model you are using. The model "spills" from one gpu when the memory is full to the next. If you use a LLaMa2-70,...
How do you run the controller?
So, this works on the `fastchat.serve.gradio_web_server_multi` (provided you restart the server), but it does not on the `fastchat.serve.gradio_web_server` - which makes the model selection tab on the web_server moot.
Ok, this now works: `--model-list-mode=reload`
Same for me: ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 $FASTCHAT/fastchat/serve/model_worker.py \ --controller $FASTCHAT_CONTROLLER:$FASTCHAT_CONTROLLER_PORT \ --port 31029 \ --worker http://$(hostname):31029 \ --num-gpus 8 \ --model-path models/Mixtral-8x22B-v0.1 ``` vLLM also works multi-gpu just fine....