dspy Reproducible results with HFModel or MLC

I noticed that the results are not reproducible. I am using Llama with BootstrapFewShot, and every time I compile the same program, I get totally different results (not even close). I noticed in the Predict forward function that the temperature is changed, so is there a way to set the temperature to 0 to be able to get reproducible results?

Sep 24 '23 10:09 Sohaila-se

Are you using llama through TGI client or through HFModel?

Sep 24 '23 10:09 okhat

If you use the TGI client, all results are reproducible. The HFModel is supported now only on a "best effort" basis. No promises that it's great yet.

Sep 24 '23 10:09 okhat

We have documentation for TGI here: https://github.com/stanfordnlp/dspy/blob/main/docs/language_models_client.md#tgi

Sep 24 '23 11:09 okhat

Well, thank you Omar. I am actually using the MLC client following this tutorial docs/using_local_models.md

Sep 24 '23 11:09 Sohaila-se

Makes sense. HFModel and especially MLC are currently great for demos / exploration, but they're not going to be reliable for systematic tests. MLC in particular is quantized so a lot of noise could exist across runs.

TGI client (for open models) and OpenAI are resilient. (You can try to do compiled_program.save('path') and later load it. That's one option for reproducibility.)

Otherwise I've assigned this issue to myself to make these reproducible through caching. This is expected to take 1--2 weeks.

Sep 24 '23 11:09 okhat

I will give the TGI client a try. Thank you @okhat.

Sep 24 '23 13:09 Sohaila-se

I was able to get reproducible results by setting the temperature of Llama to zero.

In hf_client.py I passed **kwargs to ChatConfig.

class ChatModuleClient(HFModel):
    def __init__(self, model, model_path, **kwargs):
        super().__init__(model=model, is_client=True)

        from mlc_chat import ChatModule
        from mlc_chat import ChatConfig

        self.cm = ChatModule(model=model, lib_path=model_path, chat_config=ChatConfig(**kwargs))

kwarg = {
    "temperature": 0.0,
    "conv_template": "LM",
}

llama = dspy.ChatModuleClient(model='mlc-chat-Llama-2-7b-chat-hf-q4f16_1', 
                              model_path='dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so',
                              **kwarg)

dspy.settings.configure(lm=llama)

This can also be done by editing mlc-chat-config.json

Sep 27 '23 14:09 Sohaila-se

Thanks @Sohaila-se ! Just fyi that MLC may be missing some features and is heavily quantized so use with some caution

Sep 27 '23 14:09 okhat

Makes sense. HFModel and especially MLC are currently great for demos / exploration, but they're not going to be reliable for systematic tests. MLC in particular is quantized so a lot of noise could exist across runs.

TGI client (for open models) and OpenAI are resilient. (You can try to do compiled_program.save('path') and later load it. That's one option for reproducibility.)

Otherwise I've assigned this issue to myself to make these reproducible through caching. This is expected to take 1--2 weeks.

@okhat, quick question. I've started running dspy on my larger tasks. When I save the TGI pipeline (or even an openai one), the lm is set to none. Is this due to improper config on my part?

    lm = dspy.OpenAI(model=model_name, api_key=os.getenv('OPENAI_API_KEY'))
    dspy.settings.configure(lm=lm)
    if RECOMPILE_INTO_LLAMA_FROM_SCRATCH:
        tp = BootstrapFewShot(metric=metric_EM, max_bootstrapped_demos=2, max_rounds=1, max_labeled_demos=2)
        compiled_boostrap = tp.compile(DetectConcern(), trainset=train_examples[:10], valset=train_examples[-10:])
        print("woof")
        try:
            compiled_boostrap.save(os.path.join(ROOT_DIR, "dat", "college", f"{model_name}_concerndetect.json"))

outputs a json file after the pipeline with

{
  "most_likely_concerns": {
    "lm": null,
    "traces": [],
    "train": [],
    "demos": "omitted"
},
  "concern_present": {
    "lm": null,
    "traces": [],
    "train": [],
    "demos": "omitted"
},

I have identical output (in terms of LM field, traces, train) when saving a Llama-13b pipeline. Am I doing something obviously wrong, or is saving still inconsistent? My pipeline (don't think it would matter) is below for reference:

class DetectConcern(dspy.Module):
    def __init__(self):
        super().__init__()
        self.most_likely_concerns = dspy.Predict(MostLikelyConcernsSignature, max_tokens=100)
        self.concern_present = dspy.ChainOfThought(ConcernPresentSignature)

    def forward(self, title, post, possible_concerns):
        # Get the most likely concerns
        most_likely_concerns = self.most_likely_concerns(
            title=title, post=post, possible_concerns=possible_concerns
        ).most_likely_concerns

        # Process the concerns
        cleaned_concerns = clean_concerns_to_list(most_likely_concerns)
        detected_concerns = []
        
        # for first six concerns, check if they are present in the post
        for clean_concern in cleaned_concerns[:6]:
            concern_present = self.concern_present(
                title=title, post=post, concern=clean_concern
            )
            is_concern_present = concern_present.concern_present
            reasoning = concern_present.reasoning
            if true_or_false(is_concern_present):
                detected_concerns.append(clean_concern)

        detected_concerns = ', '.join(detected_concerns)
        return detected_concerns

At the moment, would a good workaround be pickling?

Nov 01 '23 18:11 darinkishore

@darinkishore LM=none is not a bug here. None just means use the current LM.

Nov 01 '23 19:11 okhat

dspy dspy copied to clipboard

Reproducible results with HFModel or MLC

dspy
dspy copied to clipboard