Perplexica
Perplexica copied to clipboard
Feature request: Authentication and multiple user support?
Dear Perplexica team:
Excellent work. Is there a plan to add authentication and multiple-user support so that the deployment can be used by different people with their own API keys?
Thanks!
@poisson-sg You may run the service behind Authelia, which will add an auth layer (at least I do so), but this will still not solve your multi-user/session request, which will be of great value. On the other hand, given that GPU resources will likely be limited, I would recommend to reduce the model choice to avoid GPU OOM situations.
@nirabo Can you go into how you integrated with Authelia?
@fobtastic
@nirabo Can you go into how you integrated with Authelia?
I've followed the nginx-proxy-manager with authelia integration video here (https://www.youtube.com/watch?v=4UKOh3ssQSU) and started both in one docker composition. perplexica was running in its default docker composition (as per repo), alongside searXNG etc. I then created a new proxy instance in the proxy-manager dashboard following the instructions from the YT video and all went to work pretty nicely. Go have a check and let me know if you hit any underwater rocks.
User management like in open-webui would be great.
User management is no the same as authentication, you still expose the API keys and share the same chats among all the users. Having multiple users would be awesome
On the other hand, given that GPU resources will likely be limited, I would recommend to reduce the model choice to avoid GPU OOM situations.
Providers that implement continuous batching like vLLM or SGLang can easily batch dozens of queries (default in vLLM is 256) in very small overhead in KV-cache size.
I have seen Ollama load one model per connection sometimes but in practice there is absolutely no need, it's basically upgrading a vector of for example 100K tokens (2 bytes per token at fp16, 50KB) to a matrix of batch-size x 100K (1MB for 20 batched query) and replacing all matrix-vector multiplications by matrix multiplications.
It's also a much better way to fully utilize the GPU because matrix-vector multiplication is memory-bound, matrix multiplication is compute-bound.
+1 here. I'll set this up and see how it goes just for me inside my own network, but it would be amazing to be able to give each user their own instance using something like authentik.