Suraj Subramanian
Suraj Subramanian
This is not an API Meta Llama offers. Please reach out to https://www.llama-api.com/ for this issue
Yes, running the 70B needs 8 GPUs as it has 8 shards. You can run it on a different number of GPUs via huggingface.
https://github.com/meta-llama/llama3?tab=readme-ov-file#access-to-hugging-face
Hi, you could try using `torch.compile(mode='reduce-overhead')` to speed up inference with CUDA graphs. We have some examples using VLLM here: https://github.com/meta-llama/llama-recipes
Looks like you're using the quantized models, it might be hampering the model's performance on numerical data. I cannot replicate this issue on the official meta llama models, I get...
I agree with you. DDP does not explicitly do anything to enforce any synchronization for optimizers, and the identical-ness is only because the same states are sent to each process....
Hi, I'm not sure what your question is. Can you share minimal code snippets so we can better understand your query?
Both are fine, in the first one you're letting the LLM determine what the first output token should be, whereas in the second one you are enforcing the first output...
Thanks - although this isn't a critical change, it can help improve readability. The correct token is `end_header_id`, if you can update the PR i'll merge it
The error is probably related to the init_method arg you have passed... why are you passing that in? Ensure your machine has 8 GPUs as that is a requirement for...