Add `outlines.generate.fsm()` API entrypoint
As suggested in https://github.com/outlines-dev/outlines/issues/666#issuecomment-1947685486, we should allow users to pass any DFA (with vocabulary = characters) to generate text that is in the vocabulary of that DFA.
We should allow the passing of an interegular.FSM because the desired operations requested in the linked issue, and those requested by users in other issues are not possible with a RegexFSM. E.g. you cannot do subtraction with RegexFSM to disallow keywords, but you can with interegular.FSM.
We should modify RegexFSM, to have RegexFSM.from_interegular_fsm
https://github.com/outlines-dev/outlines/blob/main/outlines/fsm/fsm.py#L94-L125
We should modify
RegexFSM, to haveRegexFSM.from_interegular_fsm
This is a hint that we have a naming problem here. And we might as well fix it until more libraries depend on Outlines' internals.
Agreed 100%. Maybe TokenIndexFSM
Can someone please explain to me, I'm confused.
Does the user have to create the FSM with interegular and pass it to a method outlines.generate.fsm(fsm=interegular_fsm) which returns the SequenceGenerator?
I guess RegexFSM.from_interegular_fsm should turn an interegular_fsm object into a RegexFSM object so that it can be processed by SequenceGenerator?
I'm interested in working on it if it's possible.
Does the user have to create the FSM with interegular and pass it to a method outlines.generate.fsm(fsm=interegular_fsm) which returns the SequenceGenerator?
Yes. interegular allows you to perform helpful operations on FSMs - concatenation, OR / XOR, intersection, negation, etc. RegexFSM doesn't have these capabilities.
I guess RegexFSM.from_interegular_fsm should turn an interegular_fsm object into a RegexFSM object so that it can be processed by SequenceGenerator?
Yes exactly!
I'm interested in working on it if it's possible.
That's great news!
The current RegexFSM.__init__ logic involves conversion from pattern -> interegular.FSM -> initialization of Outlines RegexFSM. Ideally from_interegular_fsm wouldn't repeat logic with __init__, but instead they'd share as much logic as possible using a helper function.
I'd work off of https://github.com/outlines-dev/outlines/blob/main/outlines/generate/regex.py#L10-L36 for the second step, creating outlines.generate.fsm()
Please ping me if you have any questions.
What do you guys think? @lapp0 @rlouf
outlines/outlines/generate/_init_.py
...
from .fsm import fsm
outlines/outlines/generate/fsm.py
def fsm(model, fsm: FSM, sampler: Sampler = multinomial()) -> SequenceGenerator:
fsm = RegexFSM.from_interegular_fsm(fsm, model.tokenizer)
device = model.device
generator = SequenceGenerator(fsm, model, sampler, device)
return generator
outlines/outlines/fsm/fsm.py
class RegexFSM:
...
@classmethod
def from_interegular_fsm(cls, interegular_fsm: FSM, tokenizer: "Tokenizer"):
from_interegular_instance = cls.__new__(cls)
@cache()
def create_states_mapping_from_interegular_fsm(
fsm: FSM, cacheable_vocabulary: Tuple[Tuple[str, int]]
) -> Tuple[dict, set]:
"""Create the variables related to the mapping between states and tokens
The parameters of the function are used for caching purpose
"""
regex_fsm, _ = make_deterministic_fsm(fsm.reduce())
states_to_token_maps, empty_token_ids = create_fsm_index_tokenizer(
regex_fsm, tokenizer
)
# We make sure that it is possible to generate strings in the language
# of the regular expression with the tokens present in the model's
# vocabulary.
if not any(
regex_fsm.finals.intersection(v.values())
for v in states_to_token_maps.values()
):
raise ValueError(
"The vocabulary does not allow us to build a sequence that matches the input regex"
)
return states_to_token_maps, empty_token_ids
(
from_interegular_instance.states_to_token_maps,
from_interegular_instance.empty_token_ids,
) = create_states_mapping_from_interegular_fsm(
interegular_fsm, tuple(sorted(tokenizer.vocabulary.items()))
)
from_interegular_instance.vocabulary = tokenizer.vocabulary.values()
from_interegular_instance.eos_token_id = tokenizer.eos_token_id
return from_interegular_instance
Nice progress @miftahmoha :)
Does an FSM object have a consistent outlines.caching.hash_arguments() value between runs? I vaguely recall having issues with this.
Could you create a draft PR?
That seems reasonable :)
@lapp0 Done.
It seems that it is the case, a different run (same input) implies a different hash for interegular.fsm.FSM object.
Should be closed thanks to @miftahmoha's implementation and great documentation!
https://github.com/outlines-dev/outlines/pull/699