Andy Ehrenberg

Results 12 comments of Andy Ehrenberg

Some of the extra GPU memory can probably be attributed to how the flax generation implements the kv cache. Check what happens when you set max new tokens to be...

Also, it doesn't make sense to run the flax stuff within a `torch.no_grad()` context.