Jorge António comments

Results 11 comments of


                                            Jorge António

Falcon implementation issues

Thanks @LaurentMazare !

Phi-3 implementation seems to be buggy on metal devices

PS: I haven't tested this same command on cuda devices, yet.

Phi-3 implementation seems to be buggy on metal devices

After inspection, and remove the penalty. I realized there is also a considerable amount of time spent on ```rs let next_token = self.logits_processor.sample(&logits)?; ``` Regarding the asynchronicity of the metal...

Phi-3 implementation seems to be buggy on metal devices

PS: after running on a RTX4090, I find it particularly fast to run inference on this model, roughly 90 tokens/sec.

The output diverges in comparison to the Python implementation.

This is actually an interesting topic. Thanks for sharing it @hugoabonizio. Even though the numerical imprecision being naturally present in different implementations, I would expect these differences to be minimal...

Running models with different precisions

Thank you @LaurentMazare ! The issue was, I believe, I was not converting the input tensor to a dtype other than `f32`. I refactored the code from ```rs for &token...

Running models with different precisions

Thanks a lot for the PR ! Unfortunately, I also have the same issue with other dtypes, including `f16`: `Candle error: Metal contiguous index_select F16 F16 not implemented`

Running models with different precisions

I see, right. It seems though that many of these models do not have support for `f16` or `bf16`. Without erroneously converting the indices to `f16`, I am getting this...

Running models with different precisions

This is interesting, on my Macbook pro machine it works with `f16`, but not with `bf16`. Thanks for the PR @LaurentMazare, it would be great to have this for both...

Add support for small page sizes

Any updates on the current PR ? @skrider