How to save one piece of data in a single testset generating step?

Open Dongchenghang opened this issue 1 year ago • 1 comments

Hi,

I am wondering how to save each generated testset data in a single testset generating step? I mean for example if I use generator.generate_with_langchain_docs and set test_size=1000, it will keep running until 1000 questions are all generated and output a testset, but if there is an error (e.g. an openai error, openai.InternalServerError: Error code: 500 - {'error': {'message': 'unexpected end of JSON input...')in the process and the process is shutted down and the questions generated before this error is missing, and now I have to run another process to restart generating 1000 questions. Example code:

testset = generator.generate_with_langchain_docs(
    documents,
    test_size=1000,
    distributions={simple: 0.25, reasoning: 0.25, multi_context: 0.25, conditional: 0.25},
    is_async=True,
    raise_exceptions=True, 
    run_config = RunConfig(max_retries=100, max_wait=120)
)

But if I use generator.generate_with_langchain_docs and set test_size=1 and run 1000 times, I think the process will calculate 1000 times embeddings of all my docs, which is cost. Example code:

testset_list = []
for distributions in [{simple:1}, {reasoning:1}, {multi_context:1}, {conditional:1}]:
    for j in range(250):
        testset = generator.generate_with_langchain_docs(
            documents,
            test_size=1,
            distributions=distributions,
            is_async=True,
            raise_exceptions=True, 
            run_config = RunConfig(max_retries=100, max_wait=120)
        )
        testset_list.append(testset)

Can you give some solution? Thank you.

Best wishes,

Sep 11 '24 08:09 Dongchenghang

Hey @Dongchenghang, I'm here to help you out with any bugs, questions, or contributions you may have. Let's squash those bugs together! 😊

To save each generated testset data in a single testset generating step using generator.generate_with_langchain_docs and avoid losing data if an error occurs, you can modify the generate_with_langchain_docs method to save intermediate results. This way, you won't need to recalculate embeddings 1000 times. Here is an example of how you can achieve this:

import pickle

def generate_with_langchain_docs(
    self,
    documents: t.Sequence[LCDocument],
    test_size: int,
    distributions: t.Optional[Distributions] = None,
    with_debugging_logs=False,
    is_async: bool = True,
    raise_exceptions: bool = True,
    run_config: t.Optional[RunConfig] = None,
    save_path: str = "intermediate_results.pkl",  # Add a parameter for the save path
):
    distributions = distributions or {}
    # chunk documents and add to docstore
    self.docstore.add_documents(
        [Document.from_langchain_document(doc) for doc in documents]
    )

    try:
        test_dataset = self.generate(
            test_size=test_size,
            distributions=distributions,
            with_debugging_logs=with_debugging_logs,
            is_async=is_async,
            raise_exceptions=raise_exceptions,
            run_config=run_config,
        )
        # Save the generated test dataset to a file
        with open(save_path, "wb") as f:
            pickle.dump(test_dataset, f)
    except Exception as e:
        # If an error occurs, save the intermediate results
        with open(save_path, "wb") as f:
            pickle.dump(self.docstore, f)
        raise e

    return test_dataset

In this modified method, the save_path parameter is added to specify where to save the intermediate results. If an error occurs during the generation process, the current state of the docstore is saved to the specified file. This way, you can resume the process without recalculating embeddings ^[1].

To continue talking to Dosu, mention @dosu.

Sep 11 '24 08:09 dosubot[bot]