Philipp Schmid

Results 136 comments of Philipp Schmid

`easyllm` is using the `huggingface_hub` library. I talked to @Wauplin. At the moment it is not possible to deactivate the cache when using the `InferenceClient`. A workaround would be if...

Another workaround could be that we add a `seed` argument when sending the multiple requests this should lead to none cached outputs. @KoutchemeCharles could you try this ? You would...

Seems to be an hardware and environment issue unrelated to the code. I used cuda 11.8

did you make changes to the flash attention patch? The example only works with falcon since it has a custom patch to use flash attention.

Yes! 👍🏻 Plan to update all my posts and remove that patches once there is an official release.

Can you share the code you use? Do you only want to do inference? What hardware do you have available?

Sure. ```bash >> nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0 ``` and ```bash...