evals
evals copied to clipboard
Eval: add Dutch lexicon - loanwords and rare words
Thank you for contributing an eval! ♥️
🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨
PLEASE READ THIS:
In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.
We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.
Also, pelase note that we're using Git LFS for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available here.
Eval details 📑
Eval name
rare-and-loanwords-dutch-lexicon
Eval description
Test the model's ability to distinguish between existing Dutch words, with a focus on loanwords and rare words.
What makes this a useful eval?
This evaluation is unique because it focuses on the Dutch language, which includes a diverse range of words. Dutch has a rich history of borrowing words from other languages (known as "leenwoorden" in Dutch), as well as a collection of older
Dutch words that may not be commonly used but are still considered valid. By assessing the model's ability to recognize and understand these loanwords and older Dutch words, this evaluation aims to improve the model's performance in handling the intricacies of the Dutch language.
ChatGPT's accuracy is only about 52%, native Dutch speakers would and language experts would perform better.
Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
Your eval should be:
- [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means either a correct answer for
Basic
evals or theFact
Model-graded eval, or an exhaustive rubric for evaluating answers for theCriteria
Model-graded eval. - [x] Include at least 15 high quality examples.
If there is anything else that makes your eval worth including, please document it below.
Unique eval value
This eval is unique because it focuses on loanwords in the Dutch language, which are used frequently in the Dutch language. Additionally, the list contains a number of rare and old words that are not used frequently but are nevertheless considered valid. By focusing on these words, this eval aims to assess the model's capacity to recognize and comprehend loanwords and rare words in the Dutch language. This will help improve the model's overall performance with the Dutch language.
Eval structure 🏗️
Your eval should
- [x] Check that your data is in
evals/registry/data/{name}
- [x] Check that your yaml is registered at
evals/registry/evals/{name}.yaml
- [x] Ensure you have the right to use the data you submit via this eval
(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Final checklist 👀
Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
- [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.
Email address validation
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.
- [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.
Limited availability acknowledgement
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
- [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.
Submit eval
- [x] I have filled out all required fields of this form
- [x] I have used Git LFS for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run
pip install pre-commit; pre-commit install
and have verified thatblack
,isort
, andautoflake
are running when I commit and push
Failure to fill out all required fields will result in the PR being closed.
Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON
Eval
{"input": [{"role": "system", "content": "You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."}, {"role": "user", "content": "Tukker"}], "ideal": "Y"}
{"input": [{"role": "system", "content": "You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."}, {"role": "user", "content": "Tureluurs"}], "ideal": "Y"}
{"input": [{"role": "system", "content": "You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."}, {"role": "user", "content": "Uilskuiken"}], "ideal": "Y"}
{"input": [{"role": "system", "content": "You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."}, {"role": "user", "content": "Va-banque"}], "ideal": "Y"}
{"input": [{"role": "system", "content": "You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."}, {"role": "user", "content": "Verkikkerd"}], "ideal": "Y"}
{"input": [{"role": "system", "content": "You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."}, {"role": "user", "content": "Verweggistan"}], "ideal": "Y"}
{"input": [{"role": "system", "content": "You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."}, {"role": "user", "content": "Vuurproef"}], "ideal": "Y"}
{"input": [{"role": "system", "content": "You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."}, {"role": "user", "content": "Wittebroodsweken"}], "ideal": "Y"}
{"input": [{"role": "system", "content": "You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."}, {"role": "user", "content": "Witwassen"}], "ideal": "Y"}
{"input": [{"role": "system", "content": "You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."}, {"role": "user", "content": "Zwartgalligheid"}], "ideal": "Y"}
Apologies, I've removed the wrong .gitattributes
and added negative samples.
Results:
[2023-06-02 17:28:16,155] [record.py:341] Final report: {'accuracy': 0.5348837209302325}. Logged to /tmp/evallogs/230602152759GXHGI4CK_gpt-4_rare-and-loanwords-dutch-lexicon.jsonl
Thanks for implementing the requested changes. I'm getting the following error while evaluating this PR.
b'File "/content/evals/evals/eval.py", line 149, in get_samples'
b'return get_jsonl(self.samples_jsonl)'
b'File "/content/evals/evals/data.py", line 114, in get_jsonl'
b'return _get_jsonl_file(path)'
b'File "/content/evals/evals/data.py", line 77, in _get_jsonl_file'
b'return list(map(json.loads, f.readlines()))'
b'File "/usr/lib/python3.10/json/__init__.py", line 346, in loads'
b'return _default_decoder.decode(s)'
b'File "/usr/lib/python3.10/json/decoder.py", line 340, in decode'
b'raise JSONDecodeError("Extra data", s, end)'
b'json.decoder.JSONDecodeError: Extra data: line 1 column 231 (char 230)'
The error is occurring because the jsonl
file isn't formatted properly and a comma (,
) is added after some newly added samples, which is causing this issue.
Thanks for implementing the requested changes. I'm getting the following error while evaluating this PR.
b'File "/content/evals/evals/eval.py", line 149, in get_samples' b'return get_jsonl(self.samples_jsonl)' b'File "/content/evals/evals/data.py", line 114, in get_jsonl' b'return _get_jsonl_file(path)' b'File "/content/evals/evals/data.py", line 77, in _get_jsonl_file' b'return list(map(json.loads, f.readlines()))' b'File "/usr/lib/python3.10/json/__init__.py", line 346, in loads' b'return _default_decoder.decode(s)' b'File "/usr/lib/python3.10/json/decoder.py", line 340, in decode' b'raise JSONDecodeError("Extra data", s, end)' b'json.decoder.JSONDecodeError: Extra data: line 1 column 231 (char 230)'
The error is occurring because the
jsonl
file isn't formatted properly and a comma (,
) is added after some newly added samples, which is causing this issue.
Sorry about that, fixed it.
Hope it works now, works on my end:
$ oaieval gpt-4 rare-and-loanwords-dutch-lexicon
[2023-06-02 21:12:12,277] [registry.py:250] Loading registry from /Users/dm/Documents/openai/evals/evals/registry/evals
[2023-06-02 21:12:12,449] [registry.py:250] Loading registry from /Users/dm/.evals/evals
[2023-06-02 21:12:12,451] [oaieval.py:110] Run started: 2306021912125RGGQ47H
[2023-06-02 21:12:12,452] [data.py:75] Fetching rare-and-loanwords-dutch-lexicon/samples.jsonl
[2023-06-02 21:12:12,453] [eval.py:33] Evaluating 129 samples
[2023-06-02 21:12:12,491] [eval.py:138] Running in threaded mode with 10 threads!
75%|██████████████████████████████████████████████████████████████▍ | 97/129 [00:09<00:04, 7.52it/s][2023-06-02 21:12:22,478] [record.py:330] Logged 203 rows of events to /tmp/evallogs/2306021912125RGGQ47H_gpt-4_rare-and-loanwords-dutch-lexicon.jsonl: insert_time=7.912ms
100%|██████████████████████████████████████████████████████████████████████████████████| 129/129 [00:12<00:00, 10.44it/s]
[2023-06-02 21:12:24,861] [record.py:341] Final report: {'accuracy': 0.5271317829457365}. Logged to /tmp/evallogs/2306021912125RGGQ47H_gpt-4_rare-and-loanwords-dutch-lexicon.jsonl
[2023-06-02 21:12:24,861] [oaieval.py:147] Final report:
[2023-06-02 21:12:24,861] [oaieval.py:149] accuracy: 0.5271317829457365
[2023-06-02 21:12:24,864] [record.py:330] Logged 55 rows of events to /tmp/evallogs/2306021912125RGGQ47H_gpt-4_rare-and-loanwords-dutch-lexicon.jsonl: insert_time=1.893ms
I'm getting the following error now while evaluating this PR:
b'File "/usr/lib/python3.10/json/__init__.py", line 346, in loads'
b'return _default_decoder.decode(s)'
b'File "/usr/lib/python3.10/json/decoder.py", line 337, in decode'
b'obj, end = self.raw_decode(s, idx=_w(s, 0).end())'
b'File "/usr/lib/python3.10/json/decoder.py", line 353, in raw_decode'
b'obj, end = self.scan_once(s, idx)'
b"json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 170 (char 169)"
The error is occurring because the json
isn't formatted properly and a comma (,
) is missing between entries in the input
array. Kindly make sure that the samples
file is formatted properly.
Thanks for updating the PR.
If I've understood this evaluation correctly, the dataset has some loanwords that are part of the Dutch language but came from other languages, and the ideal answer for such words is
N
. Model is only being asked,Does this word exist in the Dutch language?
and logically, such words exist in Dutch, and model would sayY
in this case.I would advise you to update the prompt and provide the model with clear instructions that it should respond with
Y
if a word is purely Dutch andN
if it is a loaned word from another language.
I remain unconvinced that the model should respond with a "N" when asked if a loanword is considered part of the Dutch language. The concept of a word being "purely Dutch" can be subjective and complex. Language is inherently dynamic and influenced by various factors, including historical, cultural, and linguistic interactions. Many words in any language have origins or influences from other languages, making it challenging to define what is considered "purely" or exclusively native to a particular language.
Even if a word is borrowed from another language, it is important to recognize that languages naturally evolve through interaction and borrowing. Loanwords become integrated and widely used in the recipient language, contributing to its growth and adaptability. Excluding loanwords from the definition of a language would undermine the dynamic and historical, cultural, and linguistic significance nature of linguistic development. Therefore, drawing a clear line between what is "purely Dutch" and what is not feasible and may not accurately reflect the complex nature of language development. In the case of loanwords, they often become integrated and widely accepted within a language over time.
In support of my stance, I refer to the esteemed reference work for the Dutch language, the Groot woordenboek der Nederlandse taal. This comprehensive resource acknowledges the inclusion of loanwords from various languages, highlighting their significant presence and contribution to the Dutch lexicon.
Sorry for any confusion. I absolutely agree that borrowed words are an essential part of language. Since I don't speak Dutch, I'm validating this dataset using translation tools. Although the ideal answer for certain words in the dataset is N
, translation tools can interpret them. A few words are listed below:
- luchtzwemmer
- klankentapper
- fwietmachine
- komkommerzuur
- dobbelman
I looked up some of these words in Norwegian, and they all conveyed the same meaning. I thus assumed that these were loan words, and this dataset marks such words as N
to see if the model knows pure Dutch words.
Either the translation tools are not working properly or there is some issue with the dataset.
None of these examples are actual Dutch Words although they very look like Dutch Words. That's why I've included them to test if the model is properly recognizing them.
By example, a few of these mentioned words are made-up words from children's comic books but don't make any actual sense:
-
[luchtzwemmer](https://nl.wikipedia.org/wiki/De_luchtzwemmers)
-
[klankentapper](https://nl.wikipedia.org/wiki/De_klankentapper)
-
[fwietmachine](https://nl.wikipedia.org/wiki/De_fwietmachine)
Another example
-
komkommerzuur (non-existent word by putting together two valid words)
-
dobbelman (brand name of famous Dutch soap , but not a Dutch word)
For reference, you can search in Dutch word here in the reference work Groot woordenboek der Nederlandse taal.
You should see GPT-4 API access enabled in your account in the next few days.
You should see GPT-4 API access enabled in your account in the next few days.
Thank you! I'm on the waitlist for GPT4-32K access, will that be part of the GPT-4 access ? Thanks!
You should see GPT-4 API access enabled in your account in the next few days.
Hi, I still have not received access. Could you have a look? Thanks!
@usama-openai Could you check internally why I didn't get my GPT4 (32K) access yet? Thanks!
@usama-openai Could you check internally why I didn't get my GPT4 (32K) access yet? Thanks!
Eval submission only provides access to simple GPT-4. Are you able to access https://platform.openai.com/playground?mode=chat&model=gpt-4?