dspy
dspy copied to clipboard
Reproducible results with HFModel or MLC
I noticed that the results are not reproducible. I am using Llama with BootstrapFewShot, and every time I compile the same program, I get totally different results (not even close). I noticed in the Predict forward function that the temperature is changed, so is there a way to set the temperature to 0 to be able to get reproducible results?
Are you using llama through TGI client or through HFModel?
If you use the TGI client, all results are reproducible. The HFModel is supported now only on a "best effort" basis. No promises that it's great yet.
We have documentation for TGI here: https://github.com/stanfordnlp/dspy/blob/main/docs/language_models_client.md#tgi
Well, thank you Omar. I am actually using the MLC client following this tutorial docs/using_local_models.md
Makes sense. HFModel and especially MLC are currently great for demos / exploration, but they're not going to be reliable for systematic tests. MLC in particular is quantized so a lot of noise could exist across runs.
TGI client (for open models) and OpenAI are resilient. (You can try to do compiled_program.save('path') and later load it. That's one option for reproducibility.)
Otherwise I've assigned this issue to myself to make these reproducible through caching. This is expected to take 1--2 weeks.
I will give the TGI client a try. Thank you @okhat.
I was able to get reproducible results by setting the temperature of Llama to zero.
In hf_client.py I passed **kwargs to ChatConfig
.
class ChatModuleClient(HFModel):
def __init__(self, model, model_path, **kwargs):
super().__init__(model=model, is_client=True)
from mlc_chat import ChatModule
from mlc_chat import ChatConfig
self.cm = ChatModule(model=model, lib_path=model_path, chat_config=ChatConfig(**kwargs))
kwarg = {
"temperature": 0.0,
"conv_template": "LM",
}
llama = dspy.ChatModuleClient(model='mlc-chat-Llama-2-7b-chat-hf-q4f16_1',
model_path='dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so',
**kwarg)
dspy.settings.configure(lm=llama)
This can also be done by editing mlc-chat-config.json
Thanks @Sohaila-se ! Just fyi that MLC may be missing some features and is heavily quantized so use with some caution
Makes sense. HFModel and especially MLC are currently great for demos / exploration, but they're not going to be reliable for systematic tests. MLC in particular is quantized so a lot of noise could exist across runs.
TGI client (for open models) and OpenAI are resilient. (You can try to do compiled_program.save('path') and later load it. That's one option for reproducibility.)
Otherwise I've assigned this issue to myself to make these reproducible through caching. This is expected to take 1--2 weeks.
@okhat, quick question. I've started running dspy on my larger tasks. When I save the TGI pipeline (or even an openai one), the lm is set to none. Is this due to improper config on my part?
lm = dspy.OpenAI(model=model_name, api_key=os.getenv('OPENAI_API_KEY'))
dspy.settings.configure(lm=lm)
if RECOMPILE_INTO_LLAMA_FROM_SCRATCH:
tp = BootstrapFewShot(metric=metric_EM, max_bootstrapped_demos=2, max_rounds=1, max_labeled_demos=2)
compiled_boostrap = tp.compile(DetectConcern(), trainset=train_examples[:10], valset=train_examples[-10:])
print("woof")
try:
compiled_boostrap.save(os.path.join(ROOT_DIR, "dat", "college", f"{model_name}_concerndetect.json"))
outputs a json file after the pipeline with
{
"most_likely_concerns": {
"lm": null,
"traces": [],
"train": [],
"demos": "omitted"
},
"concern_present": {
"lm": null,
"traces": [],
"train": [],
"demos": "omitted"
},
I have identical output (in terms of LM field, traces, train) when saving a Llama-13b pipeline. Am I doing something obviously wrong, or is saving still inconsistent? My pipeline (don't think it would matter) is below for reference:
class DetectConcern(dspy.Module):
def __init__(self):
super().__init__()
self.most_likely_concerns = dspy.Predict(MostLikelyConcernsSignature, max_tokens=100)
self.concern_present = dspy.ChainOfThought(ConcernPresentSignature)
def forward(self, title, post, possible_concerns):
# Get the most likely concerns
most_likely_concerns = self.most_likely_concerns(
title=title, post=post, possible_concerns=possible_concerns
).most_likely_concerns
# Process the concerns
cleaned_concerns = clean_concerns_to_list(most_likely_concerns)
detected_concerns = []
# for first six concerns, check if they are present in the post
for clean_concern in cleaned_concerns[:6]:
concern_present = self.concern_present(
title=title, post=post, concern=clean_concern
)
is_concern_present = concern_present.concern_present
reasoning = concern_present.reasoning
if true_or_false(is_concern_present):
detected_concerns.append(clean_concern)
detected_concerns = ', '.join(detected_concerns)
return detected_concerns
At the moment, would a good workaround be pickling?
@darinkishore LM=none is not a bug here. None just means use the current LM.