langchain icon indicating copy to clipboard operation
langchain copied to clipboard

HuggingFaceModel using direct generate/decode calls

Open seanaedmiston opened this issue 2 years ago • 6 comments
trafficstars

Provides greater control over generation and passes lists rather than single strings to HF transformer for better GPU utilization (2X in my case) when running models locally

seanaedmiston avatar Apr 09 '23 00:04 seanaedmiston

Unfortunately resolving the linting issues is beyond me. I am not even sure that the approach I have used (overriding _call in the BaseLLM) is even allowed. However it "works for me" and the GPU utilization improvement is significant for large batch jobs that I thought it might be interesting to others. If anyone can help 'bash this into shape' that would be appreciated.

seanaedmiston avatar Apr 09 '23 00:04 seanaedmiston

@seanaedmiston Tangentally related question/request - Could you possibly post a repo with a toy example of using a local LLM so the less sophisticated of us could take advantage of this? The langchain docs are pretty light on details regarding self-hosted LLM's, and having a simple working example would go a long way toward helping those of us for whom Python is not our primary language.

tensiondriven avatar Apr 17 '23 04:04 tensiondriven

@tensiondriven Just saw your comment. I think the docs have been improved in this regard - but actually it is deceptively simple... HF transformers allows a relative pathname for a model name. So for example if you load a model called google/flan-t5-base that will load from the hf hub. But if you load a model called 'my_dir/to/my_model' and the model files are in that directory - then it will load from there.

seanaedmiston avatar Apr 23 '23 23:04 seanaedmiston

Wow, delightful. As a human being, I appreciate it!

tensiondriven avatar Apr 23 '23 23:04 tensiondriven

apologies for my ignorance, what's difference between this and HuggingFacePipeline? is it that it uses self.model.generate on the underlying model to batch generate?

dev2049 avatar May 18 '23 00:05 dev2049

So yes, this calls 'generate' and then 'decode' separately rather letting the pipeline do it. This is handy if you want to try different decoding strategies.

The other issue is performance. IF you are using a local model and have a bunch of different prompts, it is SIGNIFICANTLY faster to pass them as a list to HF transformers rather than one by one. (i.e. in my proposed HuggingFace_Model, the 'prompts' input is a list of strings rather than just a string as in the HF pipeline). This particular improvement could be added to the HuggingFacePipeline class pretty easily since pretty much all of the HF Transformer methods support 'str or List[str]'.

seanaedmiston avatar May 22 '23 06:05 seanaedmiston

@seanaedmiston Hi , could you, please, resolve the merging issues and address the last comments (if needed)? After that ping me and I push this PR for the review. Thanks!

If this PR is not needed anymore, could you, please, let me know?

leo-gan avatar Sep 13 '23 00:09 leo-gan

I do not have the capacity at the moment to tidy this up. I still think the idea is useful, but for now we have just moved the functionality out of the langchain library, so this PR isn't really needed atm.

seanaedmiston avatar Sep 13 '23 03:09 seanaedmiston

Closing.

leo-gan avatar Sep 13 '23 15:09 leo-gan