Can I generate multi-hop reference_contexts Only? (without the question and reference)
Your Question I was wondering if the API support only generating the reference_contexts and not the whole flow? In addition, where can I find the actual links the context is based on?
Code Examples Currently I am doing:
query_distribution = [(MultiHopSpecificQuerySynthesizer(llm=self.generator_llm),1)]
generator = TestsetGenerator(llm=self.generator_llm, embedding_model=self.generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size= self.n_generations, query_distribution=query_distribution)
chunks = []
for sample in dataset.samples:
# Check if the sample has any reference contexts.
if sample.eval_sample.reference_contexts:
for context in sample.eval_sample.reference_contexts:
# Create a simple chunk object with page_content and metadata.
chunk = {
"page_content": context,
"metadata": {"Links": "TODO"}
}
chunks.append(chunk)
Thanks in advance.
Hi @GraderYuval,
The API doesn't support only reference_contexts generation; the testset sample is created by the _generate_sample function, which returns:
return SingleTurnSample(
user_input=response.query,
reference=response.answer,
reference_contexts=reference_context,
)
Could you clarify what you mean by "the actual links the context is based on"?
For context, the current testset generation flow works like this: the knowledge graph (KG) is constructed, clusters are identified from the KG, scenarios are generated based on those clusters, and then the testset sample is created.
Hi, thank you very much for the reply. Regarding the "actual links...", I meant that if I use URLs to obtain the context, I want to know which specific URL each context hop came from. For example, if we are crawling a webpage and creating a multi-hop context from it, can I retrieve from the dataloader ('docs') the actual metadata of the URLs from which the context was extracted?