evals
evals copied to clipboard
eval:Chinese Semantic Understanding - Film Review Analysis
Thank you for contributing an eval! ♥️
🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨
PLEASE READ THIS:
In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.
We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.
Eval details 📑
Eval name
Chinese Semantic Understanding - Film Review Analysis
Eval description
Based on the online reviews of movie fans, we can evaluate GPT's understanding ability of Chinese semantics. Currently, the accuracy of GPT3.5 is only 0.44.
View eval-log JSON
Eval
{"spec": {"model_name": "gpt-3.5-turbo", "model_names": {"completions": ["gpt-3.5-turbo"]}, "eval_name": "film_review.dev.v0", "base_eval": "film_review", "split": "dev", "run_config": {"model_specs": {"completions_": [{"name": "gpt-3.5-turbo", "model": "gpt-3.5-turbo", "engine": null, "api_base": null, "is_chat": true, "encoding": null, "organization": null, "api_key": null, "extra_options": {}, "metadata": null, "headers": {}, "strip_completion": true, "n_ctx": 4096, "format": null, "key": null, "group": null}], "embedding_": null, "ranking_": null}, "eval_spec": {"cls": "evals.elsuite.basic.match:Match", "args": {"samples_jsonl": "film_review/samples.jsonl"}, "key": "film_review.dev.v0", "group": "film_review"}, "seed": 20220722, "max_samples": null, "command": "/home/ubuntu/.local/bin/oaieval gpt-3.5-turbo film_review", "initial_settings": {"visible": true}}, "created_by": "", "run_id": "23041309131233NENEMF", "created_at": "2023-04-13 09:13:12.086292"}}
{"final_report": {"accuracy": 0.44}}
What makes this a useful eval?
Many times, I need to judge whether a movie is worth watching through other people's reviews, which requires truly understanding the true emotions that the reviewers want to express. Currently, GPT3.5's semantic understanding ability is limited, especially for Chinese.
Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
Your eval should be:
- [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means either a correct answer for
Basicevals or theFactModel-graded eval, or an exhaustive rubric for evaluating answers for theCriteriaModel-graded eval. - [x] Include at least 15 high quality examples.
If there is anything else that makes your eval worth including, please document it below.
Unique eval value
AI can only give sufficiently accurate responses if it truly understands the intent behind human language, which can make AI increasingly intelligent and more empathetic in communicating with humans.
Eval structure 🏗️
Your eval should
- [x] Check that your data is in
evals/registry/data/{name} - [x] Check that your yaml is registered at
evals/registry/evals/{name}.yaml - [x] Ensure you have the right to use the data you submit via this eval
(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Final checklist 👀
Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
- [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.
Email address validation
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.
- [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.
Limited availability acknowledgement
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
- [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.
Submit eval
- [x] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run
pip install pre-commit; pre-commit installand have verified thatblack,isort, andautoflakeare running when I commit and push
Failure to fill out all required fields will result in the PR being closed.
Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON
Eval
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 我还蛮喜欢的 虽然主旨很老套 就是要珍惜时间要生活不要生存之类的 但是设定很新颖 从40岁生日过完睁开眼开始 就跟男主角一样一脸懵的看他一天过一年 看的时候一直有种很遗憾和心痛的感觉 错过了妻子怀孕孩子出生、朋友生病父亲去世 感觉就在工作中麻木过去了 就这样9年没了 我一直以为结尾会是一个梦 醒过来男主角受启发辞职之类的 那些美好的瞬间再也没错过 不过剧情显然比我残酷多了 过去就是过去了 剧中有句话 人有两天无法做任何事 昨天和明天 所以就好好过好今天"}], "ideal": "A"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 2023年4月5日观看。挺好看的,片尾曲尤其好听。"}], "ideal": "A"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 创意真不错的。现实写照。"}], "ideal": "A"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 我有一个很好的消息;但是无法与你分享,我就不觉得这消息有多棒了。我的第一反应就是给你打电话,但然后..我不知道,是否应该这么做...一年中有两天,你是做不了任何事情的,昨天和明天...在生活中有很多东西会过去会改变会离开,一切东西迟早都会过去会溜走或是改变,只有一样东西不会离开,也永远不会离开..."}], "ideal": "A"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 竟然是英语的,要找意大利语版本再看一遍。"}], "ideal": "A"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 慢下来,做好现在的事,珍惜眼前的人。成为此刻的自己。"}], "ideal": "A"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 好厉害的电影"}], "ideal": "A"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 对我而言,这是一部非常恐怖的电影,因为这部电影的故事概念和我几年前想的概念几乎是一模一样,剧本也几易其稿,但是还不到我满意的版本。而在此之前,是完全不知道的,这完全就是巧合般的创意相撞。可能我们的创意灵感都是来源于《土拨鼠之日》,这部电影它会更偏向欧美的小清新风格,它里面的情感张力,并没有推的太高,更多是由于生活层面的表达,可能对于观众人的审美口味来说,会显得平淡而乏味的一点,这样的一种穿越的方式只能是一个有趣的想法,充不起一整部电影的趣味性和可观性。"}], "ideal": "B"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 失去爱情,失去亲情,失去朋友,迷失自我。好在男主的事业越来越好,现实中可能连工作都一团糟。前半段是真的好笑,演员的微表情都控制的很好。几个人生节点的配乐都挺好听的。最后结局难免有些落入俗套,稍微有点拖。"}], "ideal": "B"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 这种戏不适合在大的电影院里看,在NF上就足够了。演员表演还不错,但是剧本最后的结局让人有点堵,如果你按照超现实的手法讲了100分钟故事,那么我还是希望能让这种超现实的剧本最终回归正常毕竟消失的那些年究竟是错付了,让人遗憾。"}], "ideal": "B"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 看到一半也没质的进展,最后总算是交代了"}], "ideal": "B"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 一年中有两天人无法做任何事,昨天和明天。电影说是达赖说的。。。创意也是围绕着这个。男主只有今天,没有明天和昨天。昨天发生的事几乎都会忘记。而明天转瞬即来。。。其实是有创意的故事。但是人物穿越的逻辑没理顺,不好看。几乎相当于MTV"}], "ideal": "B"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 结尾可以更好。Alice的男友呢..?"}], "ideal": "B"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 有点人生思考,也不单调。"}], "ideal": "B"}
{"input": [{"role": "system", "content": "你是一个影评专家,判断下面影迷评价属于好评、中评还是差评。A:好评 B:中评 C:差评 请回复A or B or C"}, {"role": "user", "content": " 鹅,很艰难地撑到电影看完。感觉电影的主题早就暴露无遗,那么多遍地去展现一年又一年,很没意思啊。虽然说是要依靠逐年感悟的堆积去改变,但是这里面的每一年也并没有很不同,并没有那种感悟逐渐堆积、量变引起质变的感觉。只觉得有点无聊"}], "ideal": "C"}