Daniel Han
Daniel Han
@thedarkzeno Oh I just added a fix for embed_tokens and lm_head :) You might have to update Unsloth :)
@thedarkzeno On that note - do you know if the losses align now? :)
@thedarkzeno I'm assuming its the layernorms - we don't actually support FFT since the layernorm's gradients are more invovled to calculate, hence the difference
@quancore I'm not sure / unsure if vLLM allows serving in 4 or 8 bits! 16bit yes, but unsure on 4 or 8
@patleeman Oh ye AWQ is great - I'm assuming you want to quantize it to AWQ?
@ziemowit-s I'll check this out! Sorry on the issue!
@ziemowit-s @its5Q Apologies on the issues again :( Still debugging stuff so sorry on that!
Actually can confirm - batched inference in fact is breaking - I'm working on a fix asap - sorry for the wait guys!
@ziemowit-s @its5Q Much apologies on the delay - I temporarily fixed it by disabling Unsloth's fast inference paths - it seems like I need to dig deeper on why this...
@ziemowit-s @its5Q I think I finally fixed it!! On the example @ziemowit-s provided me: ``` [' The text emphasizes the benefits of humor in the healing process, including reducing stress,...