ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

TypeError when using TransformersEmbeddings

Open Kailuo-Lai opened this issue 1 year ago • 3 comments

When I try to use bigdl TransformersEmbeddings, I encounter a wired problem Code

self.embeddings = TransformersEmbeddings.from_model_id(model_id=f"../checkpoints/{self.embed_version}")
self.embeddings.encode_kwargs = {"truncation": True, "max_length": 512, "padding": True}
self.vectorstore_en = FAISS.from_texts(en_texts, self.embeddings, metadatas=[{"video_clip": str(i)} for i in range(len(en_texts))])

Output

Traceback (most recent call last):
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/gradio/queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/gradio/blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/gradio/blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
    return await future
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/gradio/utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "/data/home/chengruilai/projects/VChat-BigDL/main_gradio.py", line 82, in log_fn
    global_en_log_result = vchat.video2log(vid_path)
  File "/data/home/chengruilai/projects/VChat-BigDL/models/vchat_bigdl.py", line 64, in video2log
    self.llm_reasoner.create_qa_chain(en_log_result)
  File "/data/home/chengruilai/projects/VChat-BigDL/models/llm_model.py", line 101, in create_qa_chain
    self.vectorstore_en = FAISS.from_texts(en_texts, self.embeddings, metadatas=[{"video_clip": str(i)} for i in range(len(en_texts))])
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/langchain/vectorstores/faiss.py", line 577, in from_texts
    embeddings = embedding.embed_documents(texts)
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/bigdl/llm/langchain/embeddings/transformersembeddings.py", line 163, in embed_documents
    embeddings = [self.embed(text, **self.encode_kwargs).tolist() for text in texts]
  File "/data/home/chengruilai/anaconda3/envs/vchat/lib/python3.9/site-packages/bigdl/llm/langchain/embeddings/transformersembeddings.py", line 163, in <listcomp>
    embeddings = [self.embed(text, **self.encode_kwargs).tolist() for text in texts]
TypeError: embed() got an unexpected keyword argument 'truncation'

Environment bigdl-llm 2.4.0

I think I find the root of this problem. When I change embed function from

  def embed(self, text: str):
      """Compute doc embeddings using a HuggingFace transformer model.

      Args:
          texts: The list of texts to embed.

      Returns:
          List of embeddings, one for each text.
      """
      input_ids = self.tokenizer.encode(text, return_tensors="pt")  # shape: [1, T]
      embeddings = self.model(input_ids, return_dict=False)[0]  # shape: [1, T, N]
      embeddings = embeddings.squeeze(0).detach().numpy()
      embeddings = np.mean(embeddings, axis=0)
      return embeddings

into

  def embed(self, text: str, **kwargs):
      """Compute doc embeddings using a HuggingFace transformer model.

      Args:
          texts: The list of texts to embed.

      Returns:
          List of embeddings, one for each text.
      """
      input_ids = self.tokenizer.encode(text, return_tensors="pt", **kwargs)  # shape: [1, T]
      embeddings = self.model(input_ids, return_dict=False)[0]  # shape: [1, T, N]
      embeddings = embeddings.squeeze(0).detach().numpy()
      embeddings = np.mean(embeddings, axis=0)
      return embeddings

I solve this problem. It seems that the parameter "self.encode_kwargs" in https://github.com/intel-analytics/BigDL/blob/d09698d1a4c76460b95d3f7c3ffda731907f9c4b/python/llm/src/bigdl/llm/langchain/embeddings/transformersembeddings.py#L176 doesn't enter the "embed" function correctly. Can someone fix this bug?

Kailuo-Lai avatar Jan 30 '24 03:01 Kailuo-Lai

Thanks for reporting. We'll look into it.

shane-huang avatar Jan 30 '24 05:01 shane-huang

We are fixing this in PR: https://github.com/intel-analytics/BigDL/pull/10051

shane-huang avatar Feb 01 '24 02:02 shane-huang

We are fixing this in PR: #10051

This bug is fixed, and please use the nightly version of tomorrow and later. Thanks for reporting this issue @Kailuo-Lai !

plusbang avatar Feb 05 '24 02:02 plusbang