Results 49 comments of Srinivas Billa

Yeah, I think first we need to solve batch inference. It's implemented in babyllama but I'm haven't tried to port it over to the main llama yet

That's fair, batch inference would be useful for me use this at scale. For example if I want to do sentiment analysis for a large dataset or summarisation at scale.

And in this case having a server to handle multiple users at the same time

@danielhanchen I'm happy to port the implementation over if you want to include it in unsloth. It would look like a separate training script with the necessary files being included...

Actually on second thought I'll work on it anyway since I also need this pretty bad lol. Since I do a lot of training runs every day it would save...

MeetKai said in the PR that it's okay with positional encoding? https://github.com/huggingface/trl/pull/1235#issuecomment-1900632280 He also said it could be implemented without FA but not sure how to do that. And yeah...

Contamination is definitely an issue. I've tested it on the same dataset that is heavily correlated (aspect based sentiment) and the difference between the packed and non packed is big.

Following on from @vrdn-23 , #3466 would be great too. I already use ray for scaling across multiple nodes. And this is the only solution that works when using models...

Thanks @mgoin , yes the performance isn't as good as INT4. However the model performance is nearly indistinguishable from fp16 which is really nice. I hope that fp6 becomes the...

@mgoin I'm a bit confused, why does fp6 not save vram? Even if the activations are in fp16. Surely the weights being in fp6 save memory right?