Refactor the API
Currently it's a bit of a mess with little to no structure.
I'll be working on making things a bit more structured and expendable.
I'd love to see some separation or even the possibility to not run the model with this repo and instead just use the sveltekit app + mongo with an API of our choice (the app looks fantastic by the way).
For example, this project: https://github.com/oobabooga/text-generation-webui lets you run a number of models (including Llama / Alpaca) with optimizations like 8bit and even the new GPTQ/4bit inference so it's possible to run 30B models using around 18GB of VRAM. It has an API that allows you to do generation without using their gradio interface too.
I think it'd also let you iterate faster as you wouldn't have to do so much work on running the model on all platforms and instead focus on the web app.