Xiao
Xiao
i use the code to run one query. ``` import torch # from transformers import BitsAndBytesConfig from llama_index.llms.huggingface import HuggingFaceLLM from llama_index.core.prompts import PromptTemplate from llama_index.core import Settings from llama_index.core...
my log ``` Loading checkpoint shards: 0%| | 0/2 [00:00
I found there are multiple llm call. I don't undestand why we need multiple llm call. first, we generate multiple new query. This is one llm call then we have...
> You are using a step decompose query transform > > So it's taking the original query and decomposing it into multiple > > The other queries are because it's...
and the `from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler` does not work according to the doc https://docs.llamaindex.ai/en/stable/examples/callbacks/LlamaDebugHandler/
I use nsys to profile the llama-index, it seems the retrieve is one the cpu side and the llm call is gpu side. is there other things that are cpu...
> @lambda7xx when running a model locally like you are, there is no advantage to async, since it is all compute bound. Async only makes sense for > > a)...
why is there the wrird llm call?
i think the below is not related to my query. ``` Question: How many Grand Slam titles does the winner of the 2020 Australian Open have? Knowledge source context: Provides...
> It seems like your LLM just barfed while generating sub-queries (this "odd" query is a refine step, but the input to the refine step is part of the prompt...