langchain icon indicating copy to clipboard operation
langchain copied to clipboard

support local GPU server without runhouse !!!

Open htang2012 opened this issue 2 years ago • 2 comments

Feature request

Hi guys, can we just launch SelfHostedEmbedding without runhouse, as a matter of fact, runhouse its self is still in the early development stage and has some issues, why does langchain depends on it for selfhosted instead of call cuda directly?

for example:

device = torch.device('cuda')

embeddings = SelfHostedEmbeddings( model_load_fn=get_pipeline, hardware=device, model_reqs=["./", "torch", "transformers"], inference_fn=inference_fn )

Motivation

use GPU local server directly instead of through runhouse.

Your contribution

htang2012 avatar Jun 04 '23 15:06 htang2012

If by local you mean directly on your machine, the is SelfHostedHuggingFaceLLM. If you mean on a local server, maybe this PR #2830 solves your problem. (Feel free to review, contribute)

Xmaster6y avatar Jun 07 '23 09:06 Xmaster6y

Hi,

I am trying to do similar thing but still getting error.

from langchain.llms import SelfHostedHuggingFaceLLM
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import runhouse as rh


def get_pipeline(model_id = "decapoda-research/llama-7b-hf"):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id)
    pipe = pipeline(
        "text-generation", model=model, tokenizer=tokenizer
    )
    return pipe

gpu = rh.cluster(name="rh-a10x", instance_type="A100:1")

hf = SelfHostedHuggingFaceLLM(
    model_load_fn=get_pipeline, hardware=gpu)

So the setup is that I have reserved a GPU and there I want to run this code. But I get the following error

ValueError: Cluster must have an ip address (i.e. be up) or have a reup_cluster method (e.g. OnDemandCluster).

I just want to use the local GPU to load the model and get response from the model. This exercise is to understand how it works for me.

If you can provide some pointers it will be really great!

thanks, Onkar

oapandit avatar Jul 07 '23 12:07 oapandit

If by local you mean directly on your machine, the is SelfHostedHuggingFaceLLM. If you mean on a local server, maybe this PR #2830 solves your problem. (Feel free to review, contribute)

This is not exactly what I need, I just need to get access to the local accelerator GPU/HPU to run model inference. not through any webserver or inference server like text-generation-inference server.

htang2012 avatar Jul 27 '23 22:07 htang2012

Hi,

I am trying to do similar thing but still getting error.

from langchain.llms import SelfHostedHuggingFaceLLM
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import runhouse as rh


def get_pipeline(model_id = "decapoda-research/llama-7b-hf"):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id)
    pipe = pipeline(
        "text-generation", model=model, tokenizer=tokenizer
    )
    return pipe

gpu = rh.cluster(name="rh-a10x", instance_type="A100:1")

hf = SelfHostedHuggingFaceLLM(
    model_load_fn=get_pipeline, hardware=gpu)

So the setup is that I have reserved a GPU and there I want to run this code. But I get the following error

ValueError: Cluster must have an ip address (i.e. be up) or have a reup_cluster method (e.g. OnDemandCluster).

I just want to use the local GPU to load the model and get response from the model. This exercise is to understand how it works for me.

If you can provide some pointers it will be really great!

thanks, Onkar

I believe that this is the big defect inside Langchain, and there is no one takes care of it.

Setting up Runhouse and getting it to work is a recurring nightmare for me. I have spent almost 90% of my efforts on setting up the Runhouse server and client and making them working smoothly, while making the langchain code run on our accelerator took less than 10% of my time.

Hi guys: Stop bond Runhouse and provide direct access to the local HPU/GPU accelerator !!!

htang2012 avatar Jul 27 '23 22:07 htang2012

+1, did you figure out how to run embeddings locally using run house?

mantrakp04 avatar Aug 28 '23 01:08 mantrakp04

Agreed, this is an unnecessary library lock-in by design. I'm sure Runhouse is fine for some, but I have my own Kubernetes cluster with GPU nodes, and would like to use LangChain services and various endpoints in addition to the CPU/docker it's running in. Right now, I may have to roll my own Embeddings instance, using HF or whatnot to access a local GPU in docker. We've done this in the past in Haystack, but I'm not really planning on going back to that.

jordanburke avatar Sep 06 '23 16:09 jordanburke

Hi, @htang2012

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, the issue you raised requests the ability to launch SelfHostedEmbedding without runhouse and instead call cuda directly for local GPU server support. There was a suggestion to use SelfHostedHuggingFaceLLM for local machine support, but you clarified the need for direct access to the local accelerator GPU/HPU without any webserver or inference server. Other users also expressed similar sentiments, highlighting the desire to use LangChain services and endpoints with local GPU support.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

dosubot[bot] avatar Dec 06 '23 17:12 dosubot[bot]