setzer22
setzer22
Same here :+1: What I would do is always enable this by default, and just have a flag to disable it.
As mentioned on the discord conversation, the real challenge here is extending the context window beyond the current cap of 2048 tokens. But in the meantime, a chat application with...
Yup, I don't see any problems here (other than this just hasn't been implemented yet) :smile: This might require some careful handling of the underlying ggml context. Make sure a...
I'd say this is in-scope for the project, but I don't have enough time to tackle this unfortunately :sweat_smile: PRs welcome for anyone who wants to take on the task!
I'd say being able to infer beyond eot is a feature some might want, even if it's just to run some experiment to see what would happen. But I'm OK...
We already do! The alpaca-lora weights (converted to ggml fomat) are compatible with the implementation im this repo. If you go to our discord server (see link in README) we...
> Namely the model context does not seem to be reset between requests The best way to handle this is to create one new `InferenceSession` per request. An inferenc session...
Alright, I did a first attempt, but couldn't manage to get it working. Here's what I tried: 1. Pulled the https://github.com/huggingface/transformers/ repository. 2. Installed torch using `pip install torch` 3....
Apart from my initial exploration, I also realized the `tokenizers` crate brings in [a ton of dependencies](https://github.com/huggingface/tokenizers/blob/main/tokenizers/Cargo.toml), plus requires installed OpenSSL libraries to build. I don't think all this is...
Hi @Narsil! Thanks a lot :) We are evaluating what's the best route to integrate this, I have a few questions if you don't mind: - We are considering a...