Nicolas Patry
Nicolas Patry
Hi @leiwen83 Indeed beam search is not implemented however we have a different algorithm which seems to work just as good or even better. `best_of` taking the best of `n`...
Beam search is much worse than best_of performance wise. The timing difference you show here a surprisingly different. How did you measure (model, harward, where did you get the timing...
Oh I see bnb-nf4 is just super slow on anything above batch_size=1. It has nothing to do with best_of.
@bloodsucker99 do you mind opening a PR for it ? I'm not sure where the clear should be added.
@Rogerwyf I made the Pr for it: https://github.com/huggingface/text-generation-inference/pull/829 Thanks you @bloodsucker99 . However, if that fixes it, it looks like it might not be an actual leak, just torch allocator...
Have tried latest image for a spin?
@ZeroYuJie What hardware + Cuda version + environement ?
> --quantize bitsandbytes-nf4 This seems to be coming up everytime I see this issue, it seems to be bnb leaking. We happen to not use it in production ourselves which...
Thanks a lot for this PR and fixing the unsoundness (unsafe). This PR seems even slightly better using Atomic instead (which are lock-free).https://github.com/huggingface/tokenizers/pull/1532
It's a very interesting idea, that has been discussed internally before. Thanks for reopening the discussion. The (legal) work needed would be non trivial, so if huggingface could get a...