langchain
langchain copied to clipboard
Self-hosted models on remote clusters, including on-demand from AWS, GCP, Azure, Lambda
New modules to facilitate easy use of embedding and LLM models on one's own cloud GPUs. Uses Runhouse to facilitate cloud RPC. Supports AWS, GCP, Azure, and Lambda today (auto-launching) and BYO hardware by IP and SSH creds (e.g. for on-prem or other clouds like Coreweave, Paperspace, etc.).
APIs The API mirrors the HuggingFaceEmbedding and HuggingFaceInstructEmbedding, but accepts an additional "hardware" parameter:
from langchain.embeddings import SelfHostedHuggingFaceEmbeddings, SelfHostedHuggingFaceInstructEmbeddings
import runhouse as rh
gpu = rh.cluster(name="rh-a10x", instance_type="A100:1")
hf = SelfHostedHuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", hardware=gpu)
# Will run on the same GPU
hf_instruct = SelfHostedHuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large", hardware=gpu)
The rh.cluster above will launch the A100 on GCP, Azure, or Lambda, whichever is enabled and cheapest (thanks to SkyPilot). You can specify a specific provider by provider='gcp'
, as well as use_spot
, region
, image_id
, and autostop_mins
. For AWS you'd need to just switch to "A10G:1". For BYO cluster, you can do:
gpu = rh.cluster(ips=['<ip of the cluster>'],
ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'},
name='rh-a10x')
Design All we're doing here is sending a pre-defined inference function to the cluster through Runhouse, which brings up the cluster if needed, installs the dependencies, and returns a callable that sends requests to run the function over gRPC. The function takes the model_id as an input, but the model is cached so only needs to be downloaded once. We can improve performance further pretty easily by pinning the model to GPU memory on the cluster. Let me know if that's of interest.
Testing Added new tests embeddings/test_self_hosted.py (which mirror test_huggingface.py) and llms/test_self_hosted_llm.py. Tests all pass on Lambda Labs (which is surprising, because the first two test_huggingface.py tests are supposedly segfaulting?). We can pin the provider used in the test to whichever is used by your CI, or you can choose to only run these on a schedule to avoid spinning up a GPU (can take ~5 minutes including installations).
- [x] Introduce SelfHostedPipeline and SelfHostedHuggingFaceLLM
- [x] Introduce SelfHostedEmbedding, SelfHostedHuggingFaceEmbedding, and SelfHostedHuggingFaceInstructEmbedding
- [x] Add tutorials for Self-hosted LLMs and Embeddings
- [x] Implement chat-your-data tutorial with Self-hosted models - https://github.com/dongreenberg/chat-your-data
Something strange happened when I merged in the upstream, but we're all good. Tests all pass and added those tutorials!
Oh, didn't realize I need to update the poetry lock, doing that now.