ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Generate a Synthetic Test Set In Other Language

Open AprilJoy opened this issue 1 year ago • 3 comments

I checked the documentation and related resources and couldn't find an answer to my question.

Your Question what is unclear to you? What would you like to know? when I use the function generate_with_langchain_docs to generate a test, the columns of question ,ground_truth is generated in ENGLISH, how could I set them in other language?

Code Examples This community speaks code. Share your code snippets to help us understand your question better.

Additional context Anything else you want to share with us?

AprilJoy avatar Jun 16 '24 02:06 AprilJoy

Hey @AprilJoy! 👋 I'm here to assist you with any bugs, questions, or contributions you have. I'm Dosu, a bot designed to help out while you wait for a human maintainer.

To generate the "question" and "ground_truth" columns in a different language using the generate_with_langchain_docs function, you need to adapt the language settings of the generator_llm and critic_llm used in the TestsetGenerator. Here's how you can modify the function to include a language parameter:

def generate_with_langchain_docs(
    self,
    documents: t.Sequence[LCDocument],
    test_size: int,
    distributions: t.Optional[Distributions] = None,
    with_debugging_logs=False,
    is_async: bool = True,
    raise_exceptions: bool = True,
    run_config: t.Optional[RunConfig] = None,
    language: str = "en"  # Add a language parameter with a default value
):
    distributions = distributions or {}
    # chunk documents and add to docstore
    self.docstore.add_documents(
        [Document.from_langchain_document(doc) for doc in documents]
    )

    # Adapt the language for the generator and critic LLMs
    self.generator_llm.adapt(language)
    self.critic_llm.adapt(language)

    return self.generate(
        test_size=test_size,
        distributions=distributions,
        with_debugging_logs=with_debugging_logs,
        is_async=is_async,
        raise_exceptions=raise_exceptions,
        run_config=run_config,
    )

In this modification, the language parameter is added to the generate_with_langchain_docs function. The adapt method is called on both generator_llm and critic_llm to set the desired language before generating the test set. This ensures that the "question" and "ground_truth" columns are generated in the specified language [1].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Jun 16 '24 02:06 dosubot[bot]

Hey @AprilJoy We understand this is important, but the current version of test generation itself has many other limitations which is why we are working on #1016 . An initial version of this will be merged by early next week and then we will collect and work on issues like this which need attention.

shahules786 avatar Jun 16 '24 06:06 shahules786

keeping a tab on #1016

jjmachan avatar Aug 02 '24 07:08 jjmachan

I suppose there is not really an update on this one? I followed the tutorial https://docs.ragas.io/en/stable/howtos/customizations/testgenerator/_language_adaptation/ with the only change that I'm using German data instead of Spanish. However, most of the questions and answers generated are in English

fhamborg avatar Feb 21 '25 11:02 fhamborg