intel-extension-for-transformers icon indicating copy to clipboard operation
intel-extension-for-transformers copied to clipboard

multi-batch support

Open NaamaVian opened this issue 2 years ago • 2 comments
trafficstars

Hi,

Two questions please:

  1. Do you support multi-prompt batching in any way? I tried via input_ids but the generation failed with "Unsupport multi-batch input-ids": https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/init.py#L137 Is there another way?

  2. Do you plan on integrating with HuggingFace's text-generation-inference?

Thanks

NaamaVian avatar Nov 15 '23 17:11 NaamaVian

Hi, @NaamaKadosh,

About 1, we have supported the static batching (padding left) process of beam search, however, it only supports some models (like gpt-j and gpt-neox). We will support other model-arch and batched top-k, top-p, and greedy search generation ways later (by the end of the year). Will let you know if done.

About 2, we will discuss it, thanks.

zhentaoyu avatar Nov 16 '23 03:11 zhentaoyu

we integrate TGI into NeuralChat: https://github.com/intel/intel-extension-for-transformers/pull/1180/files, but there is no way to combine runtime and TGI now

kevinintel avatar Jan 24 '24 12:01 kevinintel

Hi, @NaamaKadosh, we support the continuous batching mechanism when use_neural_speed, please refer to https://github.com/intel/neural-speed/blob/main/docs/continuous_batching.md for more details and usage. We will add a related ITREX related example soon. If you have no other questions, we will close this issue. Thanks.

zhentaoyu avatar Jun 05 '24 07:06 zhentaoyu