edsl Add ability to use HuggingFace models

Replicate results from: https://github.com/socialfoundations/surveying-language-models

Apr 13 '24 14:04 johnjosephhorton

Hello

As open models are already supported in the cloud via Deep-Infra, all that is needed is a bit of new code in edsl/inference_services/llama-cpp.py. This could be:

A hack: Recognize when the OpenAI proxy is pointing local (already possible via env vars) and llama-cpp is installed, then just reuse your existing edsl/inference_services/OpenAIService.py code
Like-for-like: Code a connector like edsl/inference_services/DeepInfraService.py that calls on the llama-cpp instance if llama-cpp-python is installed
Advantaged: A slightly more substantial client to take advantage of the local running, interfacing with llama-cpp directly instead of relying on the OpenAI proxy, similar to this implementatation

I see in some of the existing connectors reference to json output being required, so y'all could look at Guidance which directly supports all your current LLMs except Deep-Infra. That is available via LiteLLM. But for now like-for-like appears easiest!

Jul 11 '24 11:07 iwr-redmond

@zer0dss as your working on services, can you add this to your TODOs?

Aug 15 '24 11:08 johnjosephhorton

@johnjosephhorton Yes, I can add it. It is related to the other integrations (bedrock, azure).

Aug 15 '24 13:08 zer0dss

Hello

I have taken note of some community-developed roleplay models that you folks may wish to test against once you have a working inference script:

turboderp/llama3-turbcat-instruct-8b
ClosedCharacter/Peach-9B-8k-Roleplay
TheDrummer/Rocinante-12B-v1.1
nothingiisreal/MN-12B-Celeste-V1.9

None of these are available in Ollama or Deep-Infra. Too specialist!

Sep 03 '24 15:09 iwr-redmond

@zer0dss did you every try any of these out?

Oct 03 '24 10:10 johnjosephhorton

@johnjosephhorton I've checked the models but HuggingFace doesn't support a serverless option for them. In order to use them you have to deploy the models on a dedicated server.

Oct 03 '24 10:10 zer0dss

I don't think that's correct. In llama-cpp-python, one uses a GGUF quantized model and can easily call on a local CPU or GPU to start inference. Huggingface documentation typically refers to the unquantized versions, which do require extremely robust non-consumer hardware. Refer again to the code snippet I linked to earlier.

Oct 04 '24 15:10 iwr-redmond

@iwr-redmond What I was referring to is the availability of this option, which will allow us to call the model using the Hugging Face API key. Screenshot 2024-10-04 at 18 36 11

Oct 04 '24 15:10 zer0dss

From Appendix A (p17) of the original paper (emphasis added):

We downloaded the publicly available language model weights from their respective official HuggingFace repositories. We run the models in an internal cluster. The total number of GPU hours needed to complete all experiments is approximately 1500 (NVIDIA A100). The budget spent querying the OpenAI models was approximately $200.

The llama-cpp-python package makes this process far less resource intensive by using quantization, and at about Q5_K_M is functionally identical to the original weights. All of the models noted, being below 13B parameters, can run easily on consumer hardware without needing the referenced internal cluster.

(edited: fixed quant statement)

Oct 04 '24 15:10 iwr-redmond

@iwr-redmond The hardware recources are not the problem. The issue is that we will not know how to do API calls to these custom models. They are not running behind an inference service. If you build a wrapper that respects the olllama or VLLM endpoints then the model can be used using the ollama service integrated inside edsl.

Oct 04 '24 16:10 zer0dss

Ah, but that is not correct. To go through the options:

Hack: the llama-cpp-python package can spin up its own OpenAI-compatible server, so just copy your deepinfra template for the default port (8000).
Like-for-like: check for the llama-cpp-python package and if it exists spin up a multi-model configuration file that is preconfigured with whichever default you would like for EDSL.
Use local inference by calling models directly; this could be done with llama-cpp-python's standard OpenAI compatible response format when called as a function or an abstraction layer like easy-llama.

Oct 04 '24 16:10 iwr-redmond

This starter code should be good enough for government work:

EDSL_LOCAL_DEFAULTS = (
    "# configuration file default"
)
try:
    from llama_cpp import server.app
    env_custom_models = os.environ.get("EDSL_LOCAL_CONFIG", EDSL_LOCAL_DEFAULTS)
    try:
         create_app(config_file=env_custom_models, port=8000) # override the port to a standard default
    except:
         print("Unable to activate local model server.")
except:
    print("Local model support not installed.")

The huggingface-hub package could also be used to auto-download weights if helpful.

Oct 04 '24 16:10 iwr-redmond

edsl edsl copied to clipboard

Add ability to use HuggingFace models

edsl
edsl copied to clipboard