Pierce Freeman
                                            Pierce Freeman
                                        
                                    @youkaichao Attached the truncated tail. It stays this way indefinitely with no additional calls written. Seems like it's either legitimately stalling out there or trying to retry the download... ```...
@youkaichao Huggingface by itself seems like it's working fine. It loads the full model in about 3.5mins. ``` from transformers import AutoModel, AutoTokenizer print("Loading initial model") model = AutoModel.from_pretrained(MODEL_DIR) tokenizer...
@youkaichao That seems fine too: ```python print("Trying to load hub...") from huggingface_hub import HfFileSystem, snapshot_download print("Did load hub...") ``` ``` Trying to load hub... Did load hub... ```
@youkaichao Sure thing, here's the environment with the additional logging. Based on this it looks like the stall is happening in a different location other than the HF hub: ```...
@youkaichao Here's the full stack trace: ``` PART 0 up_helper in /usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:1429 2024-04-23 05:48:36.124556 Return from pg_to_tag in /usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:534 to _new_process_group_helper in /usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:1429 2024-04-23 05:48:36.124669 Return from _new_process_group_helper in /usr/local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:1430...
I suspected there was something funky going on with NCCL. The image / hardware configuration is relatively conventional, though, so I wonder if there's something amiss at the host OS...
@majestichou At least in my case, the issue was prompted by cross-GPU coordination (NCCL in particular) on an inference box. Doesn't thus far seem to be architecture related so might...
If you're already using Chromium, this is pretty easy to do over CDP. You'll just need to use the [new](https://developer.chrome.com/articles/new-headless/) headless mode or a headful spawn, since the old headless...
@Yard1 I've tried llama3 on v0.4.0.post1, but this issue is still present when initializing the engine with lora adapters. The latest `main` code seems to get further in the initialization...
@Techinix You can either manually install the wheel from the [Release page](https://github.com/vllm-project/vllm/releases/tag/v0.4.1) or build yourself locally from the [tagged git version](https://github.com/vllm-project/vllm/tree/v0.4.1). On my setup, local building takes around 15-30min.