outlines icon indicating copy to clipboard operation
outlines copied to clipboard

Translate Pydantic models into a regular expression that accept the corresponding YAML

Open rlouf opened this issue 1 year ago • 2 comments

rlouf avatar May 27 '24 16:05 rlouf

I love the idea, yaml uses fewer syntactic tokens and allows language models to generate without needing to keep track of as much "nesting" / context.

Here's what I'm thinking for a strategy, would love to hear your thoughts:

We should refactor fsm/json_schema.py so it uses a class-based approach with handler methods for each type. Then we can subclass to implement the different behavior in yaml.

class JSONSchemaRegexGenerator:
    def __init__(self):
        self.handlers = {
            "string": self.handle_string,
            "array": self.handle_array,
            ...
        }

    @classmethod
    def get_pattern(cls, schema):
        return cls().handle_node(schema)

    def get_pattern(self, node):
        handler = self.handlers.get(node["type"], self.handle_default)
        return handler(node)

    def handle_string(self, node):
        return STRING

    def handle_array(self, node):
        ...
        return rf"\[{whitespace_pattern}({'|'.join(regexes)})(,{whitespace_pattern}({'|'.join(regexes)})){num_repeats}){allow_empty}{whitespace_pattern}\]"


class YAMLSchemaRegexGenerator(JSONSchemaRegexGenerator):
    def handle_array(self, node):
        """handle format for yaml arrays:
            - elem0
            - elem1
        """
        ...     

This would make the code more readable, extensible, reduce technical debt, and make it so we don't have to have conditional handling for a passed is_yaml for many rules within to_regex()

lapp0 avatar May 29 '24 07:05 lapp0

I can get on board with this. To follow ast.NodeVisitor's naming scheme we could name the handlers visit_X. I think we should first implement a first version of the converter to YAML with only a few primitives before refactoring.

rlouf avatar Jun 05 '24 12:06 rlouf