Origin of Persona-Based Synthetic Test Dataset Generation

Open xXFiEsTaDeAmOnXx opened this issue 9 months ago • 0 comments

[x] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question Does the idea of generating questions based on personas originate from the Eval-Instruct paper or the Scaling Synthetic Data Creation with 1,000,000,000 Personas paper? Or is it a custom implementation? Are there scientific references I can cite for this concept in my work?

Code Examples

class PersonaGenerationPrompt(PydanticPrompt[StringIO, Persona]):
    instruction: str = (
        "Using the provided summary, generate a single persona who would likely "
        "interact with or benefit from the content. Include a unique name and a "
        "concise role description of who they are."
    )
    input_model: t.Type[StringIO] = StringIO
    output_model: t.Type[Persona] = Persona
    examples: t.List[t.Tuple[StringIO, Persona]] = [
        (
            StringIO(
                text="Guide to Digital Marketing explains strategies for engaging audiences across various online platforms."
            ),
            Persona(
                name="Digital Marketing Specialist",
                role_description="Focuses on engaging audiences and growing the brand online.",
            ),
        )
    ]

class QueryAnswerGenerationPrompt(PydanticPrompt[QueryCondition, GeneratedQueryAnswer]):
    instruction: str = (
        "Generate a single-hop query and answer based on the specified conditions (persona, term, style, length) "
        "and the provided context. Ensure the answer is entirely faithful to the context, using only the information "
        "directly from the provided context."
        "### Instructions:\n"
        "1. **Generate a Query**: Based on the context, persona, term, style, and length, create a question "
        "that aligns with the persona's perspective and incorporates the term.\n"
        "2. **Generate an Answer**: Using only the content from the provided context, construct a detailed answer "
        "to the query. Do not add any information not included in or inferable from the context.\n"
    )
    input_model: t.Type[QueryCondition] = QueryCondition
    output_model: t.Type[GeneratedQueryAnswer] = GeneratedQueryAnswer
    examples: t.List[t.Tuple[QueryCondition, GeneratedQueryAnswer]] = [
        (
            QueryCondition(
                persona=Persona(
                    name="Software Engineer",
                    role_description="Focuses on coding best practices and system design.",
                ),
                term="microservices",
                query_style="Formal",
                query_length="Medium",
                context="Microservices are an architectural style where applications are structured as a collection of loosely coupled services. "
                "Each service is fine-grained and focuses on a single functionality.",
            ),
            GeneratedQueryAnswer(
                query="What is the purpose of microservices in software architecture?",
                answer="Microservices are designed to structure applications as a collection of loosely coupled services, each focusing on a single functionality.",
            ),
        ),
    ]

Mar 28 '25 12:03 xXFiEsTaDeAmOnXx