edsl icon indicating copy to clipboard operation
edsl copied to clipboard

Add ability to use HuggingFace models

Open johnjosephhorton opened this issue 10 months ago • 12 comments

Replicate results from: https://github.com/socialfoundations/surveying-language-models

johnjosephhorton avatar Apr 13 '24 14:04 johnjosephhorton

Hello

As open models are already supported in the cloud via Deep-Infra, all that is needed is a bit of new code in edsl/inference_services/llama-cpp.py. This could be:

  1. A hack: Recognize when the OpenAI proxy is pointing local (already possible via env vars) and llama-cpp is installed, then just reuse your existing edsl/inference_services/OpenAIService.py code
  2. Like-for-like: Code a connector like edsl/inference_services/DeepInfraService.py that calls on the llama-cpp instance if llama-cpp-python is installed
  3. Advantaged: A slightly more substantial client to take advantage of the local running, interfacing with llama-cpp directly instead of relying on the OpenAI proxy, similar to this implementatation

I see in some of the existing connectors reference to json output being required, so y'all could look at Guidance which directly supports all your current LLMs except Deep-Infra. That is available via LiteLLM. But for now like-for-like appears easiest!

iwr-redmond avatar Jul 11 '24 11:07 iwr-redmond

@zer0dss as your working on services, can you add this to your TODOs?

johnjosephhorton avatar Aug 15 '24 11:08 johnjosephhorton

@johnjosephhorton Yes, I can add it. It is related to the other integrations (bedrock, azure).

zer0dss avatar Aug 15 '24 13:08 zer0dss

Hello

I have taken note of some community-developed roleplay models that you folks may wish to test against once you have a working inference script:

  • turboderp/llama3-turbcat-instruct-8b
  • ClosedCharacter/Peach-9B-8k-Roleplay
  • TheDrummer/Rocinante-12B-v1.1
  • nothingiisreal/MN-12B-Celeste-V1.9

None of these are available in Ollama or Deep-Infra. Too specialist!

iwr-redmond avatar Sep 03 '24 15:09 iwr-redmond

@zer0dss did you every try any of these out?

johnjosephhorton avatar Oct 03 '24 10:10 johnjosephhorton

@johnjosephhorton I've checked the models but HuggingFace doesn't support a serverless option for them. In order to use them you have to deploy the models on a dedicated server.

zer0dss avatar Oct 03 '24 10:10 zer0dss

I don't think that's correct. In llama-cpp-python, one uses a GGUF quantized model and can easily call on a local CPU or GPU to start inference. Huggingface documentation typically refers to the unquantized versions, which do require extremely robust non-consumer hardware. Refer again to the code snippet I linked to earlier.

iwr-redmond avatar Oct 04 '24 15:10 iwr-redmond

@iwr-redmond What I was referring to is the availability of this option, which will allow us to call the model using the Hugging Face API key. Screenshot 2024-10-04 at 18 36 11

zer0dss avatar Oct 04 '24 15:10 zer0dss

From Appendix A (p17) of the original paper (emphasis added):

We downloaded the publicly available language model weights from their respective official HuggingFace repositories. We run the models in an internal cluster. The total number of GPU hours needed to complete all experiments is approximately 1500 (NVIDIA A100). The budget spent querying the OpenAI models was approximately $200.

The llama-cpp-python package makes this process far less resource intensive by using quantization, and at about Q5_K_M is functionally identical to the original weights. All of the models noted, being below 13B parameters, can run easily on consumer hardware without needing the referenced internal cluster.

(edited: fixed quant statement)

iwr-redmond avatar Oct 04 '24 15:10 iwr-redmond

@iwr-redmond The hardware recources are not the problem. The issue is that we will not know how to do API calls to these custom models. They are not running behind an inference service. If you build a wrapper that respects the olllama or VLLM endpoints then the model can be used using the ollama service integrated inside edsl.

zer0dss avatar Oct 04 '24 16:10 zer0dss

Ah, but that is not correct. To go through the options:

  1. Hack: the llama-cpp-python package can spin up its own OpenAI-compatible server, so just copy your deepinfra template for the default port (8000).
  2. Like-for-like: check for the llama-cpp-python package and if it exists spin up a multi-model configuration file that is preconfigured with whichever default you would like for EDSL.
  3. Use local inference by calling models directly; this could be done with llama-cpp-python's standard OpenAI compatible response format when called as a function or an abstraction layer like easy-llama.

iwr-redmond avatar Oct 04 '24 16:10 iwr-redmond

This starter code should be good enough for government work:

EDSL_LOCAL_DEFAULTS = (
    "# configuration file default"
)
try:
    from llama_cpp import server.app
    env_custom_models = os.environ.get("EDSL_LOCAL_CONFIG", EDSL_LOCAL_DEFAULTS)
    try:
         create_app(config_file=env_custom_models, port=8000) # override the port to a standard default
    except:
         print("Unable to activate local model server.")
except:
    print("Local model support not installed.")

The huggingface-hub package could also be used to auto-download weights if helpful.

iwr-redmond avatar Oct 04 '24 16:10 iwr-redmond