ragas icon indicating copy to clipboard operation
ragas copied to clipboard

[R-288] Limit Evolution

Open ArjunBakhale opened this issue 1 year ago • 2 comments

[ ] I checked the documentation and related resources and couldn't find an answer to my question.

Is there a way to add a arg to testset generation that sets max evolutions to something like 1 instead of 5

i.e

run_config = RunConfig(timeout= 360, max_retries=2, max_wait=60, max_workers=32, log_tenacity=True, **max_evolutions=2**)

current_testset = generator.generate_with_langchain_docs(document, test_size=test_size_gen, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, run_config = run_config )

Thank you :)

R-288

ArjunBakhale avatar Jul 31 '24 15:07 ArjunBakhale

Hey @ArjunBakhale! 👋 I'm here to help you with your question. I'm a bot designed to assist with bug fixes, answer questions, and guide you on becoming a contributor.

Yes, you can add an argument to the test set generation to set the maximum number of evolutions. You can modify the generate method in the TestsetGenerator class to include a max_evolutions parameter and use it to limit the number of evolutions.

Here is the modified generate method with the max_evolutions parameter:

def generate(
    self,
    test_size: int,
    distributions: t.Optional[Distributions] = None,
    with_debugging_logs=False,
    is_async: bool = True,
    raise_exceptions: bool = True,
    run_config: t.Optional[RunConfig] = None,
    max_evolutions: int = 5,  # Add max_evolutions parameter with default value 5
):
    distributions = distributions or DEFAULT_DISTRIBUTION
    # validate distributions
    if not check_if_sum_is_close(list(distributions.values()), 1.0, 3):
        raise ValueError(
            f"distributions passed do not sum to 1.0 [got {sum(list(distributions.values()))}]. Please check the "
            f"distributions."
        )

    # configure run_config for docstore
    if run_config is None:
        run_config = RunConfig(max_retries=15, max_wait=90)
    self.docstore.set_run_config(run_config)

    # init filters and evolutions
    for evolution in distributions:
        self.init_evolution(evolution)
        evolution.init(is_async=is_async, run_config=run_config)

    if with_debugging_logs:
        from ragas.utils import patch_logger

        patch_logger("ragas.testset.evolutions", logging.DEBUG)
        patch_logger("ragas.testset.extractor", logging.DEBUG)
        patch_logger("ragas.testset.filters", logging.DEBUG)
        patch_logger("ragas.testset.docstore", logging.DEBUG)
        patch_logger("ragas.llms.prompt", logging.DEBUG)

    exec = Executor(
        desc="Generating",
        keep_progress_bar=True,
        raise_exceptions=raise_exceptions,
        run_config=run_config,
    )

    current_nodes = [
        CurrentNodes(root_node=n, nodes=[n])
        for n in self.docstore.get_random_nodes(k=test_size)
    ]
    total_evolutions = 0
    for evolution, probability in distributions.items():
        for i in sample(range(test_size), round(probability * test_size)):
            if total_evolutions >= max_evolutions:
                break
            exec.submit(
                evolution.evolve,
                current_nodes[i],
                name=f"{evolution.__class__.__name__}-{i}",
            )
            total_evolutions += 1
    if total_evolutions < test_size:
        filler_evolutions = choices(
            list(distributions), k=test_size - total_evolutions
        )
        for evolution in filler_evolutions:
            if total_evolutions >= max_evolutions:
                break
            exec.submit(
                evolution.evolve,
                current_nodes[total_evolutions],
                name=f"{evolution.__class__.__name__}-{total_evolutions}",
            )
            total_evolutions += 1

    try:
        test_data_rows = exec.results()
        if not test_data_rows:
            raise ExceptionInRunner()

    except ValueError as e:
        raise e
    # make sure to ignore any NaNs that might have been returned
    # due to failed evolutions. MaxRetriesExceeded is a common reason
    test_data_rows = [r for r in test_data_rows if not is_nan(r)]
    test_dataset = TestDataset(test_data=test_data_rows)
    evol_lang = [get_feature_language(e) for e in distributions]
    evol_lang = [e for e in evol_lang if e is not None]
    track(
        TestsetGenerationEvent(
            event_type="testset_generation",
            evolution_names=[e.__class__.__name__.lower() for e in distributions],
            evolution_percentages=[distributions[e] for e in distributions],
            num_rows=len(test_dataset.test_data),
            language=evol_lang[0] if len(evol_lang) > 0 else "",
        )
    )

    return test_dataset

With this change, you can now call the generate method with the max_evolutions parameter to set the maximum number of evolutions. For example:

testset_generator.generate(test_size=100, max_evolutions=1)

This approach can be similarly applied to the generate_with_langchain_docs method by passing the max_evolutions parameter to the generate method [1].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Jul 31 '24 15:07 dosubot[bot]

@ArjunBakhale that is a good suggestion but I'm not sure if we should do it like that (still thinking about it). Could you tell me the reasoning behind the suggestion to get some more context?

but alternatively this is what you should be able to do today

# max_tries for each evolution
simple.max_tries = 1
reasoning.max_tries = 1
multi_context.max_tries = 1

current_testset = generator.generate_with_langchain_docs(
  document, 
  test_size=test_size_gen, 
  distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, 
  run_config = run_config 
)

what do you think of that? alternatively I'm thinking we should rename that to something like max_evolutions to make it easier to understand?

jjmachan avatar Aug 02 '24 05:08 jjmachan