syncode
syncode copied to clipboard
Efficient and general syntactical decoding for Large Language Models
To make SynCode work on serving LLMs
If we are parsing multiple partial codes the parsers currently run sequentially in a single thread, this should be much faster parallel in multiple threads. [Example](https://github.com/microsoft/monitors4codegen/blame/022c65efb19cf6046d0b67960ca232ae7a351af4/src/monitors4codegen/monitor_guided_decoding/hf_gen.py#L62) [Options](https://stackoverflow.com/questions/27435284/multiprocessing-vs-multithreading-vs-asyncio)
When we are fast enough with LR(1) parsing, we should add support for OpenAI models as in [this](https://github.com/microsoft/monitors4codegen)
First of all, thank you for this great work! I would like to know if Syncode can also work with fill-in-the-middle models? If yes, how?
Currently, when trying to run constrained decoding with a new grammar, we are prompted with ``` Creating DFA mask store for LlamaTokenizerFast and custom, may take more than 10 minutes....
In the following example, the constrained model fails to generate a string that matches the specified grammar: ```python import traceback import lark import torch from transformers import AutoModelForCausalLM, AutoTokenizer from...