paolovic comments

Results 51 comments of


                                            paolovic

batch size

Somehow I don't see the recommendation to apply model.eval here @Parskatt , but thank you, I changed my implementation to ``` batch = {"im_A": query_images, "im_B": ref_batch_images} roma_model.eval() with torch.inference_mode():...

batch size

> Yeah github bugged for me and showed my comment as duplicated, removed one and both disappeared... alright, in any case thank you very much!

Inference takes large GPU memory

For me the same, it basically takes the whole GPU, in my case almost 11GB this is how I could reduce it ``` import tensorflow as tf physical_devices = tf.config.list_physical_devices('GPU')...

Inference takes large GPU memory

> @paolovic Hi, can you tell me in more detail how you solved it? Hi, I inserted the snippet from my previous post after the imports to the omniglue_demo script,...

[Installation]: git clone cutlass fails

well...I know that's what I wrote, maybe I was not able to express my case clearly enough... but I cannot "open" my internet connection, I work in a restricted environment...

[Installation]: git clone cutlass fails

@youkaichao @ringos Thank you very much! I'll try it out and will come back to you! Best regards

[Installation]: git clone cutlass fails

@youkaichao @ringos Thank you very much for your support! Finally, ringos' approach did the trick for me `GIT_REPOSITORY https://github.com/nvidia/cutlass.git` => `GIT_REPOSITORY `.

[Bug]: VLLM 0.8.2 OOM error (No error in 0.7.3 version)

I have 2x L40s cannot reproduce with `Meta-Llama-3.1-8B-Instruct-quantized.w8a16` https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16 invoked like this: ```bash VLLM_USE_V1=1 vllm serve Meta-Llama-3.1-8B-Instruct-quantized.w8a16 --host 0.0.0.0 --served-model-name llama3.1-8B llama3.1-8B-Int8 --port 8000 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser llama3_json ```...

[Bug]: VLLM 0.8.2 OOM error (No error in 0.7.3 version)

Hi @manitadayon , I am downloading https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1 now. How did you quantize it? Using huggingface + autogptq? How many bits? Thank you and best regards

[Bug]: VLLM 0.8.2 OOM error (No error in 0.7.3 version)

@manitadayon Mistral in half precision is larger than `nvidia/Llama-3_3-Nemotron-Super-49B-v1` in 4-bit. Anyway, as I was hoping for a memory leak, I was hoping they would lead to an OOM error...