Steven Krawczyk (Hegel AI) comments

Results 16 comments of


                                            Steven Krawczyk (Hegel AI)

LlamaIndex Integration

Hey @rachittshah we'd love your help! I think you're right that we'll need to build prompttools abstractions on top of the llamaindex abstractions. I'm hoping there are a few abstractions...

Issue with HuggingFaceHubExperiment: Unable to Use Models Other Than gpt2

Thanks for bringing this issue up, I'm looking into it

Issue with HuggingFaceHubExperiment: Unable to Use Models Other Than gpt2

Which model in particular are you trying to use? The huggingface hub experiment uses the Inference API, so it will only support models which that API supports. Is there a...

It can't load with llama_load_model_from_file: failed to load model

Have you downloaded the model `llama-2-7b-chat.ggmlv3.q2_K.bin` and followed the setup instructions at https://github.com/ggerganov/llama.cpp and https://github.com/abetlen/llama-cpp-python?

Generated Robustness Evaluation Sample Scripts

Thanks for raising this @RigvedRocks! I'll review it now. When you get a chance, could you fill out the CLA?

Generated Robustness Evaluation Sample Scripts

I'm not very familiar with promptbench but it looks like you want to run attacks as experiments and use your eval function to evaluate the experiment outputs. You can check...

Add support for other models in AutoEval

Hey @Divij97 this looks great! Very elegant way to support Anthropic + OpenAI as evaluators. I'm guessing Claude and GPT will need different eval prompts, but this is definitely headed...

Add common benchmarks

@LuvvAggarwal Sure thing. The scope of this one is a bit large because we currently don't have any common benchmarks. I think a simple case would be the following *...

Add common benchmarks

@LuvvAggarwal using datasets sounds like a good start. As far as using evaluate, we want to write our own eval methods that support more than just huggingface (e.g. OpenAI, Anthropic)

Add common benchmarks

For example, if you are using the hellaswag dataset, we need to compute the accuracy of predictions, e.g. https://github.com/openai/evals/blob/main/evals/metrics.py#L12