mlx-llm
mlx-llm copied to clipboard
stupid question, any way to avoid running out of memory?
If it put an input of 17,000 tokens into model.generate(x, temperature)
I get
libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 19081554496 bytes which is greater than the maximum allowed buffer size of 17179869184 bytes.
I guess it is trying to use the mac GPU? Or if regular memory, it can't swap? I can run this Llama 3 8b instruct with regular Transformers, it is just really slow.
There's no flag for use_swap=True
or anything like that, right?