outlines icon indicating copy to clipboard operation
outlines copied to clipboard

Add `outlines.generate.fsm()` API entrypoint

Open rlouf opened this issue 1 year ago • 10 comments

As suggested in https://github.com/outlines-dev/outlines/issues/666#issuecomment-1947685486, we should allow users to pass any DFA (with vocabulary = characters) to generate text that is in the vocabulary of that DFA.

rlouf avatar Feb 16 '24 08:02 rlouf

We should allow the passing of an interegular.FSM because the desired operations requested in the linked issue, and those requested by users in other issues are not possible with a RegexFSM. E.g. you cannot do subtraction with RegexFSM to disallow keywords, but you can with interegular.FSM.

We should modify RegexFSM, to have RegexFSM.from_interegular_fsm

https://github.com/outlines-dev/outlines/blob/main/outlines/fsm/fsm.py#L94-L125

lapp0 avatar Feb 16 '24 22:02 lapp0

We should modify RegexFSM, to have RegexFSM.from_interegular_fsm

This is a hint that we have a naming problem here. And we might as well fix it until more libraries depend on Outlines' internals.

rlouf avatar Feb 17 '24 09:02 rlouf

Agreed 100%. Maybe TokenIndexFSM

lapp0 avatar Feb 17 '24 21:02 lapp0

Can someone please explain to me, I'm confused.

Does the user have to create the FSM with interegular and pass it to a method outlines.generate.fsm(fsm=interegular_fsm) which returns the SequenceGenerator?

I guess RegexFSM.from_interegular_fsm should turn an interegular_fsm object into a RegexFSM object so that it can be processed by SequenceGenerator?

I'm interested in working on it if it's possible.

miftahmoha avatar Feb 19 '24 23:02 miftahmoha

Does the user have to create the FSM with interegular and pass it to a method outlines.generate.fsm(fsm=interegular_fsm) which returns the SequenceGenerator?

Yes. interegular allows you to perform helpful operations on FSMs - concatenation, OR / XOR, intersection, negation, etc. RegexFSM doesn't have these capabilities.

I guess RegexFSM.from_interegular_fsm should turn an interegular_fsm object into a RegexFSM object so that it can be processed by SequenceGenerator?

Yes exactly!

I'm interested in working on it if it's possible.

That's great news!

The current RegexFSM.__init__ logic involves conversion from pattern -> interegular.FSM -> initialization of Outlines RegexFSM. Ideally from_interegular_fsm wouldn't repeat logic with __init__, but instead they'd share as much logic as possible using a helper function.

I'd work off of https://github.com/outlines-dev/outlines/blob/main/outlines/generate/regex.py#L10-L36 for the second step, creating outlines.generate.fsm()

Please ping me if you have any questions.

lapp0 avatar Feb 20 '24 07:02 lapp0

What do you guys think? @lapp0 @rlouf

outlines/outlines/generate/_init_.py

...
from .fsm import fsm

outlines/outlines/generate/fsm.py

def fsm(model, fsm: FSM, sampler: Sampler = multinomial()) -> SequenceGenerator:
    fsm = RegexFSM.from_interegular_fsm(fsm, model.tokenizer)
    device = model.device
    generator = SequenceGenerator(fsm, model, sampler, device)
    return generator

outlines/outlines/fsm/fsm.py

class RegexFSM:
    ...
    @classmethod
    def from_interegular_fsm(cls, interegular_fsm: FSM, tokenizer: "Tokenizer"):
        from_interegular_instance = cls.__new__(cls)

        @cache()
        def create_states_mapping_from_interegular_fsm(
            fsm: FSM, cacheable_vocabulary: Tuple[Tuple[str, int]]
        ) -> Tuple[dict, set]:
            """Create the variables related to the mapping between states and tokens
            The parameters of the function are used for caching purpose
            """
            regex_fsm, _ = make_deterministic_fsm(fsm.reduce())
            states_to_token_maps, empty_token_ids = create_fsm_index_tokenizer(
                regex_fsm, tokenizer
            )

            # We make sure that it is possible to generate strings in the language
            # of the regular expression with the tokens present in the model's
            # vocabulary.
            if not any(
                regex_fsm.finals.intersection(v.values())
                for v in states_to_token_maps.values()
            ):
                raise ValueError(
                    "The vocabulary does not allow us to build a sequence that matches the input regex"
                )

            return states_to_token_maps, empty_token_ids

        (
            from_interegular_instance.states_to_token_maps,
            from_interegular_instance.empty_token_ids,
        ) = create_states_mapping_from_interegular_fsm(
            interegular_fsm, tuple(sorted(tokenizer.vocabulary.items()))
        )
        from_interegular_instance.vocabulary = tokenizer.vocabulary.values()
        from_interegular_instance.eos_token_id = tokenizer.eos_token_id
        return from_interegular_instance

miftahmoha avatar Feb 20 '24 12:02 miftahmoha

Nice progress @miftahmoha :)

Does an FSM object have a consistent outlines.caching.hash_arguments() value between runs? I vaguely recall having issues with this.

Could you create a draft PR?

lapp0 avatar Feb 20 '24 12:02 lapp0

That seems reasonable :)

rlouf avatar Feb 20 '24 12:02 rlouf

@lapp0 Done.

It seems that it is the case, a different run (same input) implies a different hash for interegular.fsm.FSM object.

miftahmoha avatar Feb 21 '24 23:02 miftahmoha

Should be closed thanks to @miftahmoha's implementation and great documentation!

https://github.com/outlines-dev/outlines/pull/699

lapp0 avatar May 09 '24 08:05 lapp0