edsl
edsl copied to clipboard
Add ability to use HuggingFace models
Replicate results from: https://github.com/socialfoundations/surveying-language-models
Hello
As open models are already supported in the cloud via Deep-Infra, all that is needed is a bit of new code in edsl/inference_services/llama-cpp.py. This could be:
- A hack: Recognize when the OpenAI proxy is pointing local (already possible via env vars) and llama-cpp is installed, then just reuse your existing edsl/inference_services/OpenAIService.py code
- Like-for-like: Code a connector like edsl/inference_services/DeepInfraService.py that calls on the llama-cpp instance if llama-cpp-python is installed
- Advantaged: A slightly more substantial client to take advantage of the local running, interfacing with llama-cpp directly instead of relying on the OpenAI proxy, similar to this implementatation
I see in some of the existing connectors reference to json output being required, so y'all could look at Guidance which directly supports all your current LLMs except Deep-Infra. That is available via LiteLLM. But for now like-for-like appears easiest!
@zer0dss as your working on services, can you add this to your TODOs?
@johnjosephhorton Yes, I can add it. It is related to the other integrations (bedrock, azure).
Hello
I have taken note of some community-developed roleplay models that you folks may wish to test against once you have a working inference script:
- turboderp/llama3-turbcat-instruct-8b
- ClosedCharacter/Peach-9B-8k-Roleplay
- TheDrummer/Rocinante-12B-v1.1
- nothingiisreal/MN-12B-Celeste-V1.9
None of these are available in Ollama or Deep-Infra. Too specialist!
@zer0dss did you every try any of these out?
@johnjosephhorton I've checked the models but HuggingFace doesn't support a serverless option for them. In order to use them you have to deploy the models on a dedicated server.
I don't think that's correct. In llama-cpp-python, one uses a GGUF quantized model and can easily call on a local CPU or GPU to start inference. Huggingface documentation typically refers to the unquantized versions, which do require extremely robust non-consumer hardware. Refer again to the code snippet I linked to earlier.
@iwr-redmond What I was referring to is the availability of this option, which will allow us to call the model using the Hugging Face API key.
From Appendix A (p17) of the original paper (emphasis added):
We downloaded the publicly available language model weights from their respective official HuggingFace repositories. We run the models in an internal cluster. The total number of GPU hours needed to complete all experiments is approximately 1500 (NVIDIA A100). The budget spent querying the OpenAI models was approximately $200.
The llama-cpp-python package makes this process far less resource intensive by using quantization, and at about Q5_K_M is functionally identical to the original weights. All of the models noted, being below 13B parameters, can run easily on consumer hardware without needing the referenced internal cluster.
(edited: fixed quant statement)
@iwr-redmond The hardware recources are not the problem. The issue is that we will not know how to do API calls to these custom models. They are not running behind an inference service. If you build a wrapper that respects the olllama or VLLM endpoints then the model can be used using the ollama service integrated inside edsl.
Ah, but that is not correct. To go through the options:
- Hack: the llama-cpp-python package can spin up its own OpenAI-compatible server, so just copy your deepinfra template for the default port (8000).
- Like-for-like: check for the llama-cpp-python package and if it exists spin up a multi-model configuration file that is preconfigured with whichever default you would like for EDSL.
- Use local inference by calling models directly; this could be done with llama-cpp-python's standard OpenAI compatible response format when called as a function or an abstraction layer like easy-llama.
This starter code should be good enough for government work:
EDSL_LOCAL_DEFAULTS = (
"# configuration file default"
)
try:
from llama_cpp import server.app
env_custom_models = os.environ.get("EDSL_LOCAL_CONFIG", EDSL_LOCAL_DEFAULTS)
try:
create_app(config_file=env_custom_models, port=8000) # override the port to a standard default
except:
print("Unable to activate local model server.")
except:
print("Local model support not installed.")
The huggingface-hub package could also be used to auto-download weights if helpful.