mlx-llm stupid question, any way to avoid running out of memory?

stupid question, any way to avoid running out of memory?

Open jmugan opened this issue 9 months ago • 2 comments

If it put an input of 17,000 tokens into model.generate(x, temperature) I get

libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 19081554496 bytes which is greater than the maximum allowed buffer size of 17179869184 bytes.

I guess it is trying to use the mac GPU? Or if regular memory, it can't swap? I can run this Llama 3 8b instruct with regular Transformers, it is just really slow.

There's no flag for use_swap=True or anything like that, right?

May 12 '24 03:05 jmugan

mlx-llm mlx-llm copied to clipboard

stupid question, any way to avoid running out of memory?

mlx-llm
mlx-llm copied to clipboard