Jorge António
Jorge António
Thanks @LaurentMazare !
PS: I haven't tested this same command on cuda devices, yet.
After inspection, and remove the penalty. I realized there is also a considerable amount of time spent on ```rs let next_token = self.logits_processor.sample(&logits)?; ``` Regarding the asynchronicity of the metal...
PS: after running on a RTX4090, I find it particularly fast to run inference on this model, roughly 90 tokens/sec.
This is actually an interesting topic. Thanks for sharing it @hugoabonizio. Even though the numerical imprecision being naturally present in different implementations, I would expect these differences to be minimal...
Thank you @LaurentMazare ! The issue was, I believe, I was not converting the input tensor to a dtype other than `f32`. I refactored the code from ```rs for &token...
Thanks a lot for the PR ! Unfortunately, I also have the same issue with other dtypes, including `f16`: `Candle error: Metal contiguous index_select F16 F16 not implemented`
I see, right. It seems though that many of these models do not have support for `f16` or `bf16`. Without erroneously converting the indices to `f16`, I am getting this...
This is interesting, on my Macbook pro machine it works with `f16`, but not with `bf16`. Thanks for the PR @LaurentMazare, it would be great to have this for both...
Any updates on the current PR ? @skrider