intel-extension-for-transformers
intel-extension-for-transformers copied to clipboard
multi-batch support
Hi,
Two questions please:
-
Do you support multi-prompt batching in any way? I tried via input_ids but the generation failed with "Unsupport multi-batch input-ids": https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/init.py#L137 Is there another way?
-
Do you plan on integrating with HuggingFace's text-generation-inference?
Thanks
Hi, @NaamaKadosh,
About 1, we have supported the static batching (padding left) process of beam search, however, it only supports some models (like gpt-j and gpt-neox). We will support other model-arch and batched top-k, top-p, and greedy search generation ways later (by the end of the year). Will let you know if done.
About 2, we will discuss it, thanks.
we integrate TGI into NeuralChat: https://github.com/intel/intel-extension-for-transformers/pull/1180/files, but there is no way to combine runtime and TGI now
Hi, @NaamaKadosh, we support the continuous batching mechanism when use_neural_speed, please refer to https://github.com/intel/neural-speed/blob/main/docs/continuous_batching.md for more details and usage. We will add a related ITREX related example soon. If you have no other questions, we will close this issue. Thanks.