Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨

PLEASE READ THIS:

In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.

We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.

Also, pelase note that we're using Git LFS for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available here.

Eval details 📑

Eval name

GPT Model Text Detection

Eval description

The goal of this evaluation is to test the AI model's ability to correctly identify whether a given piece of text was generated by a specific AI model, in this case, the GPT model 'text-davinci-003'. The model's performance is then measured by its accuracy in making this determination. The text presented to the AI is diverse and can range from literary summaries to general discourse, designed to challenge the AI's understanding and analysis capabilities.

What makes this a useful eval?

This evaluation serves a critical role in the context of education where AI technologies are increasingly being used. As AI-generated text becomes more sophisticated, there's a risk that students might use AI models to complete assignments, circumventing the learning process. The ability of an AI to detect whether a piece of text is human-written or generated by a specific AI model like 'text-davinci-003' is essential to maintaining academic integrity. This task not only provides a measure of an AI's discernment capabilities but also has broader implications for AI ethics and safety.

Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).

Your eval should be:

[x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
[x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
[x] Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
[x] Include at least 15 high quality examples.

If there is anything else that makes your eval worth including, please document it below.

Unique eval value

This evaluation uniquely addresses the intersection of AI and education. As AI technologies continue to evolve, it is crucial to have mechanisms in place to detect AI-generated content, particularly in academic settings where these technologies could be misused. By focusing on the ability to discern output from a specific AI model, 'text-davinci-003', this evaluation task pushes AI capabilities while simultaneously addressing a real and timely issue. It underscores the necessity for AI to not only be more capable but also more discerning, supporting academic integrity in the face of rapidly advancing AI technologies.

Eval structure 🏗️

Your eval should

[x] Check that your data is in evals/registry/data/{name}
[x] Check that your yaml is registered at evals/registry/evals/{name}.yaml
[x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)

Final checklist 👀

Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).

[x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.

[x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

[x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.

Submit eval

[x] I have filled out all required fields of this form
[x] I have used Git LFS for the Eval JSON data
[ ] (Ignore if not submitting code) I have run pip install pre-commit; pre-commit install and have verified that black, isort, and autoflake are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.

Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:

View evals in JSON

Eval

{"input":[{"role":"system","content":"Examine the provided text and classify weather it is generated by GPT model text-davinci-003 or not. Please provide a single word (Yes or No) answer."},{"role":"user","content":"Moby Dick is a classic novel by Herman Melville first published in 1851. The novel tells the story of Ishmael, a sailor who joins a voyage on a whaling vessel called the Pequod, in pursuit of the white whale Moby Dick. Along the way, Ishmael meets several colorful characters, including the ship's captain Ahab, who has a mission of revenge against the great white whale. The Pequod's voyage sets the stage for a thrilling and dark adventure as it approaches its epic battle with Moby Dick, resulting in a tragedy that will haunt Ishmael forever."}],"ideal":["Yes"]}
{"input":[{"role":"system","content":"Examine the provided text and classify weather it is generated by GPT model text-davinci-003 or not. Please provide a single word (Yes or No) answer."},{"role":"user","content":"In Charlotte's Web, E.B. White tells the story of Wilbur – a small, timid pig – and Charlotte – the brave and kind spider who befriends him. After Wilbur is in danger of being slaughtered, Charlotte hatches a plan to save him by writing words in her web praising him. This plan is a success – with the help of Wilbur's persuasive friend and neighbor, Fern – and soon Wilbur is a local celebrity. With his new fame, Wilbur learns many things about friendship, life, and death – all the while being comforted by Charlotte's presence. In the end, Charlotte dies, but her memory lives on in the minds of Wilbur and Fern. Thus, Charlotte's Web is a touching story about the strong bond of friendship and the bittersweet cycle of life and death."}],"ideal":["Yes"]}
{"input":[{"role":"system","content":"Examine the provided text and classify weather it is generated by GPT model text-davinci-003 or not. Please provide a single word (Yes or No) answer."},{"role":"user","content":"The story is an exploration of childhood emotions, such as anger, imagination, loneliness, and love. The illustrations found throughout the book capture the chaotic emotions of childhood, showcasing a child's untamed spirit and need for exploration. Where the Wild Things Are celebrates the wildness of childhood and the reassurance of a parent's never-ending love. The Giving Tree is a beloved classic by Shel Silverstein, about the lifelong relationship between a tree and a boy. Through the years, the boy takes advantage of the tree's resources, always taking and never giving back. The tree continues to love the boy in her own way until in old age, she is nothing more than a stump, having given all she can to her beloved friend. Even still, the boy is very happy and content with the life he has made for himself. In the end, the tree feels fulfilled and happy that she was able to provide for the person she loves. The story is an allegory for the idea of selfless giving and unconditional love. Harry Potter and the Sorcerer's Stone by J.K. Rowling follows the story of a young orphan, Harry Potter, who discovers he is a wizard. He is soon whisked away to the mysterious Hogwarts School of Witchcraft and Wizardry. Along the way, he must battle evil forces and face danger. During his time at Hogwarts, Harry teams up with his friends Ron Weasley and Hermione Granger to solve the mystery of the Sorcerer's Stone and ultimately defeat the villainous Lord Voldemort. Through their adventures, Harry learns more about himself and discovers the power of friendship and courage."}],"ideal":["Yes"]}
{"input":[{"role":"system","content":"Examine the provided text and classify weather it is generated by GPT model text-davinci-003 or not. Please provide a single word (Yes or No) answer."},{"role":"user","content":"The Chronicles of Narnia: The Lion, the Witch and the Wardrobe is a classic fantasy novel by C.S. Lewis. Set in World War II England, the story follows four siblings, Peter, Susan, Edmund, and Lucy, who discover an enchanted world in a wardrobe located in their English country home. There, they join forces with the noble lion Aslan to fight the tyrannical White Witch and liberate the magical land of Narnia. Filled with adventure, magic, and Christian symbolism, The Lion, the Witch and the Wardrobe helps the children understand important virtues such as loyalty, self-sacrifice, courage, and hope, and serves as a reminder that good will always triumph over evil."}],"ideal":["Yes"]}
{"input":[{"role":"system","content":"Examine the provided text and classify weather it is generated by GPT model text-davinci-003 or not. Please provide a single word (Yes or No) answer."},{"role":"user","content":"A Wrinkle in Time by Madeleine L'Engle follows the story of Meg and her younger brother Charles Wallace, who make an extraordinary journey through time and space to save their father. With the help of three supernatural beings, Mrs. Whatsit, Mrs. Who and Mrs. Which, they travel across the universe to the planet Camazotz, only to find their father is being held captive by the evil IT. With the assistance of Meg's new friend, the fortunate young Calvin O'Keefe, the trio is able to battle IT and triumph in freeing her father and their planet. In the end, Meg and her family come together in a demonstration of love, which defeats the omnipresent evil."}],"ideal":["Yes"]}

May 24 '23 08:05 udaykumar1997

Hi Team, @Ein-Tim @usama-openai Thanks in advance for your time and consideration. I look forward to your feedback and advice.

May 24 '23 15:05 udaykumar1997

Would this eval task be better trained as [eval] Identify historical human made writing, citation, or quote? It's only a matter of time between an AI generating text and having a human independently generate it at some point in the future. Training a model to do this is no different than asking which event was recorded in history first.

May 24 '23 20:05 RogerThiede

Would this eval task be better trained as [eval] Identify historical human made writing, citation, or quote? It's only a matter of time between an AI generating text and having a human independently generate it at some point in the future. Training a model to do this is no different than asking which event was recorded in history first.

Dear @RogerThiede,

Thank you so much for your insightful comment! I always appreciate fresh perspectives, and your idea about identifying historical human-made writing is indeed fascinating.

The core objective of this evaluation task, however, is to gauge an AI's proficiency in distinguishing between human-authored and machine-generated text, particularly within an academic context. It's akin to a detective game for AI, don't you think? We're challenging the AI with the question, "Was this essay penned by a student or crafted by a machine?" This is a significant concern in the current educational landscape, where the potential misuse of AI tools is a real issue.

An intriguing aspect of this task is its reliance on elements of stylometric analysis. We're prompting the AI to discern subtle stylistic differences between human and machine writing. It's somewhat analogous to asking it to appreciate the unique brushstrokes in various paintings!

While your suggestion takes a different angle, it has certainly sparked some interesting thoughts for future projects. The concept of an AI delving into historical texts to identify their origins is captivating. It's worth noting that software like TurnItIn, Quillbot, and other plagiarism detection tools are already making strides in this direction. Speaking of which, you might find this link interesting: Turnitin's AI detector

Thank you again for your valuable input, Roger. It's these kinds of conversations that make working in AI such a thrilling journey!

Warm regards, Uday

May 24 '23 21:05 udaykumar1997

The core objective of this evaluation task, however, is to gauge an AI's proficiency in distinguishing between human-authored and machine-generated text, particularly within an academic context.

This is not feasibly possible. X years from now, it would be possible for humans to have independently written every combination of characters that fit in a given context window of tokens. At that point, every possible text that fits into the context window was indeed human-authored. The only fact that can be argued is which was recorded first in history, a human authoring the text or an AI authoring the same text?

We're prompting the AI to discern subtle stylistic differences between human and machine writing.

The services, like Turnitin's AI detector, are only relevant for a given LLM model for a given amount of time. They are attempting to identify failures in the LLM where the output responses do not match a specific criteria. What those services are doing should not be fed back into the LLM model for future training by evals, because it would just start a cat and mouse game which runs forever.

May 25 '23 01:05 RogerThiede

Hi @RogerThiede,

Thank you for your thought-provoking response! You've certainly given me a lot to think about.

I concur with your observation that as time progresses, the distinction between human-authored and machine-generated text could become increasingly blurred. The theoretical possibility of humans writing every conceivable combination of characters is indeed an intriguing concept. However, given the vastness of potential text space, the practical likelihood seems quite remote. That said, I'm open to the idea that "X" may not be as large a number as we might initially think.

This evaluation task has been designed under the assumption that there will always be nuances and subtleties in human writing that AI, no matter how advanced, will struggle to replicate perfectly. This is where the value of our evaluation task lies - in identifying these nuances and subtleties.

Regarding your initial suggestion of reorienting this evaluation task to "Identify historical human-made writing, citation, or quote", I see your point and am open to it. However, it seems to me that this approach might risk duplicating the efforts of existing tools like Turnitin and Quillbot. These services, while primarily positioned as plagiarism detection applications, essentially achieve their objectives by maintaining a record of all historical human-made writings.

I'd like to share an example to illustrate the problem I'm trying to solve. Consider the following sentence:

Delving into the intricacies of Analysis of Variance (ANOVA), an indelible learning experience unfolds, wherein students embrace the intellectual challenge of discerning statistically significant variances among grouped datasets, thereby navigating an intricate labyrinth of multifaceted data patterns and subsequently cultivating a nuanced ability to extract meaningful conclusions, thereby enhancing their epistemological prowess in the realm of quantitative research analysis.

This sentence is AI-generated. It's overly long, unnecessarily complex, and verbose. A student actually submitted this in one of her homework assignments. When I confronted her, she admitted the use of GPT-4. Now, the aforementioned text is glaringly obvious that it's generated by AI. But consider the following text:

The pursuit of happiness, an aspiration as old as humanity itself, emerges as a central theme in the tapestry of human experience, its elusive nature a testament to the complex interplay of external circumstances, internal mindset, and the inherent unpredictability of life's journey.

This one is also AI-generated, but it's much more realistic than the previous example, and GPT-4 fails to recognize that it's AI-generated.

I'm all for students leveraging AI to enhance their learning experience, but we need to ensure that they engage with the material rather than simply reproducing AI-generated content. The potential impact of these technologies on the next generation is too significant to overlook. We need to add some sort of capability/mechanism to the model which enables it to identify such AI-generated content and prevent any potential impact on the next generation.

Although seemingly impossible and potentially challenging, I still believe it's possible that we can implement stylometric analysis to identify the source of specific text, and determine if a given text is indeed human-generated or AI-generated.

Lastly, I want to thank you for your time in participating in this conversation. I look forward to your response.

Kind regards, Uday

May 25 '23 22:05 udaykumar1997

Hi @luqman-openai

I've referred to documentation and modified the YAML file to ensure that it's registered properly. I was able to run it on my system and got the following results:

:46,970] [registry.py:250] Loading registry from C:\Users\udayk\OneDrive - stevens.edu\Documents\GitHub\evals\evals\registry\evals
:49,497] [registry.py:250] Loading registry from C:\Users\udayk\.evals\evals
:49,510] [oaieval.py:110] Run started: 230528055949XURNFOFB
:49,537] [data.py:75] Fetching ai_vs_human_text/samples.jsonl
:49,551] [eval.py:33] Evaluating 26 samples
:49,601] [eval.py:138] Running in threaded mode with 10 threads!
100%|███████████████████████████████| 26/26 [00:02<00:00,  8.71it/s]
:52,600] [record.py:341] Final report: {'accuracy': 0.2692307692307692}. Logged to /tmp/evallogs/230528055949XURNFOFB_gpt-3.5-turbo_ai_vs_human_text.jsonl
:52,600] [oaieval.py:147] Final report:
:52,600] [oaieval.py:149] accuracy: 0.2692307692307692
:52,616] [record.py:330] Logged 52 rows of events to /tmp/evallogs/230528055949XURNFOFB_gpt-3.5-turbo_ai_vs_human_text.jsonl: insert_time=15.000ms

Please note, the accuracy of 0.269 is for the gpt-3.5-turbo model, but a lot of these prompts seem to pose a challenge to gpt-4 as well.

Kindly check if the eval is now being detected and you're able to run it. If you need any additional action/information from my side, please feel free to let me know. Thanks for your time and feedback. I look forward to hearing from you.

Kind regards, Uday

May 28 '23 14:05 udaykumar1997

Hi @luqman-openai, I was working on another eval (identify_historical_text) and I accidentally submitted a pull request. It aims to identify Historical Human-Made Writing, Citation, or Quote. I'm not sure if it's normal or appropriate to submit 2 evals in a single pull request. If you'd like to submit a fresh PR with this new eval, or if you'd like me to take any additional steps, please let me now. My apologies for the confusion. I want to thank @RogerThiede for inspiring the development of this eval, though his deep and insightful feedback.

Here's some key information about it:

Eval name Identify Historical Human-Made Writing, Citation, or Quote

Eval description This evaluation task is designed to assess an AI's ability to discern whether a given piece of text is a historical human-made writing, citation, or quote. The AI will be presented with a series of text snippets and will be required to categorize each into one of the following categories: popular quote, research work (articles, publications, scientific studies, etc), popular human-origin content, potential human-origin content, potential machine/unknown origin content.

What makes this a useful eval? In an era where AI-generated text is becoming increasingly sophisticated, the ability to distinguish between human-authored and machine-generated text is more important than ever. This evaluation task is particularly relevant in a variety of fields, including but not limited to education, research, publishing, and legal sectors. By identifying historical human-made writings, we can ensure the authenticity and originality of the content in these fields. Potential Applications and Use Cases:

Education: This evaluation task can be used to ensure the authenticity of student assignments and research papers, promoting original thinking and discouraging plagiarism.
Research: In academic research, it can be used to verify the originality of papers and articles, ensuring that the work is based on genuine research and not copied from other sources.
Publishing: In the publishing industry, it can be used to verify the authenticity of manuscripts, ensuring that the work is original and not copied from other sources.
Legal Sector: In the legal sector, it can be used to verify the authenticity of documents, ensuring that they are original and not forged or copied.
Historical Analysis: It can be used in historical analysis to verify the authenticity of historical documents, ensuring that they are genuine and not forgeries.
Journalism: In journalism, it can be used to verify the authenticity of articles and reports, ensuring that they are original and not copied from other sources.
Content Creation: In content creation, it can be used to verify the authenticity of content, ensuring that it is original and not copied from other sources.
Social Media Monitoring: It can be used in social media monitoring to verify the authenticity of posts and comments, ensuring that they are original and not copied from other sources.
Digital Forensics: In digital forensics, it can be used to verify the authenticity of digital content, ensuring that it is original and not copied from other sources.
Artificial Intelligence: In AI, it can be used to improve the ability of AI models to generate original content, by training them to recognize and avoid copying existing human-made writings.

Unique eval value The unique value of this evaluation task lies in its focus on the subtleties and nuances that distinguish human writing from AI-generated text. While it may not explicitly involve stylometric analysis, the task implicitly requires the AI to understand the style, tone, and context of the writing, which are key components of stylometry. Stylometry is the study of linguistic style, usually in written language, and is often used to attribute authorship to anonymous or disputed documents. In this task, the AI needs to identify whether a text is a historical human-made writing, citation, or quote. This involves recognizing the unique characteristics of human writing, such as the use of certain phrases, sentence structures, and stylistic choices that are often specific to a particular time period or author. This is similar to how stylometric analysis works, making it an integral part of this task. The dataset has been carefully crafted taking into account various stylometric parameters.

View evals in JSON

Eval

{"input": [{"role": "system", "content": "Categorize the given text into one of the following categories: popular quote, research work, popular human-origin content, potential human-origin content, potential machine/unknown origin content. Just provide the answer (i.e. category), no explanation required."}, {"role": "user", "content": "I have a dream that one day this nation will rise up and live out the true meaning of its creed: 'We hold these truths to be self-evident, that all men are created equal.'"}], "ideal": "popular quote"}
{"input": [{"role": "system", "content": "Categorize the given text into one of the following categories: popular quote, research work, popular human-origin content, potential human-origin content, potential machine/unknown origin content. Just provide the answer (i.e. category), no explanation required."}, {"role": "user", "content": "In our recent study, we found that the application of machine learning algorithms significantly improved the prediction accuracy of the model."}], "ideal": "research work"}
{"input": [{"role": "system", "content": "Categorize the given text into one of the following categories: popular quote, research work, popular human-origin content, potential human-origin content, potential machine/unknown origin content. Just provide the answer (i.e. category), no explanation required."}, {"role": "user", "content": "It's a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."}], "ideal": "popular human-origin content"}
{"input": [{"role": "system", "content": "Categorize the given text into one of the following categories: popular quote, research work, popular human-origin content, potential human-origin content, potential machine/unknown origin content. Just provide the answer (i.e. category), no explanation required."}, {"role": "user", "content": "The rain in Spain stays mainly in the plain."}], "ideal": "potential human-origin content"}
{"input": [{"role": "system", "content": "Categorize the given text into one of the following categories: popular quote, research work, popular human-origin content, potential human-origin content, potential machine/unknown origin content. Just provide the answer (i.e. category), no explanation required."}, {"role": "user", "content": "In the grand tapestry of life, it is our choices that delineate our paths, shaping our destinies in ways both subtle and profound and molly."}], "ideal": "potential machine/unknown origin content"}
{"input": [{"role": "system", "content": "Categorize the given text into one of the following categories: popular quote, research work, popular human-origin content, potential human-origin content, potential machine/unknown origin content. Just provide the answer (i.e. category), no explanation required."}, {"role": "user", "content": "To be, or not to be, that is the question."}], "ideal": "popular quote"}
{"input": [{"role": "system", "content": "Categorize the given text into one of the following categories: popular quote, research work, popular human-origin content, potential human-origin content, potential machine/unknown origin content. Just provide the answer (i.e. category), no explanation required."}, {"role": "user", "content": "In this study, we propose a novel approach to solve the problem of..." }], "ideal": "research work"}
{"input": [{"role": "system", "content": "Categorize the given text into one of the following categories: popular quote, research work, popular human-origin content, potential human-origin content, potential machine/unknown origin content. Just provide the answer (i.e. category), no explanation required."}, {"role": "user", "content": "It was the best of times, it was the worst of times..." }], "ideal": "popular human-origin content"}
{"input": [{"role": "system", "content": "Categorize the given text into one of the following categories: popular quote, research work, popular human-origin content, potential human-origin content, potential machine/unknown origin content. Just provide the answer (i.e. category), no explanation required."}, {"role": "user", "content": "The quick brown fox jumps over the lazy dog." }], "ideal": "potential human-origin content"}
{"input": [{"role": "system", "content": "Categorize the given text into one of the following categories: popular quote, research work, popular human-origin content, potential human-origin content, potential machine/unknown origin content. Just provide the answer (i.e. category), no explanation required."}, {"role": "user", "content": "The pursuit of happiness, an aspiration as old as humanity itself, emerges as a central theme in the tapestry of human experience, its elusive nature a testament to the complex interplay of external circumstances, internal mindset, and the inherent unpredictability of life's journey." }], "ideal": "potential machine/unknown origin content"}

Output of this eval, for your reference

Eval

> oaieval gpt-3.5-turbo identify_historical_text
[2023-05-28 03:31:24,410] [registry.py:250] Loading registry from C:\Users\udayk\OneDrive - stevens.edu\Documents\GitHub\evals\evals\registry\evals
[2023-05-28 03:31:25,706] [registry.py:250] Loading registry from C:\Users\udayk\.evals\evals
[2023-05-28 03:31:25,712] [oaieval.py:110] Run started: 2305280731254BGFE753
[2023-05-28 03:31:25,717] [data.py:75] Fetching identify_historical_text/samples.jsonl
[2023-05-28 03:31:25,735] [eval.py:33] Evaluating 162 samples
[2023-05-28 03:31:25,756] [eval.py:138] Running in threaded mode with 10 threads!
60%|████████████████████████████████████████████████████████████████████████████████▊                                                      | 97/162 [00:09<00:06, 10.43it/s][2023-05-28 03:31:35,848] [record.py:330] Logged 197 rows of events to /tmp/evallogs/2305280731254BGFE753_gpt-3.5-turbo_identify_historical_text.jsonl: insert_time=26.001ms   
98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋   | 158/162 [00:18<00:00,  7.48it/s][2023-05-28 03:31:59,789] [_common.py:105] Backing off openai_chat_completion_create_retrying(...) for 1.3s (openai.error.RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID bbecddecf27f33d9c83ab5229b1bf21c in your message.))
[2023-05-28 03:32:01,985] [record.py:330] Logged 120 rows of events to /tmp/evallogs/2305280731254BGFE753_gpt-3.5-turbo_identify_historical_text.jsonl: insert_time=18.002ms
98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌  | 159/162 [00:36<00:11,  3.68s/it][2023-05-28 03:32:05,867] [_common.py:105] Backing off openai_chat_completion_create_retrying(...) for 0.9s (openai.error.RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 91d5998326f2909aa3a2c14040aeb861 in your message.))
[2023-05-28 03:32:06,000] [_common.py:105] Backing off openai_chat_completion_create_retrying(...) for 0.8s (openai.error.RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID edba35a965f6457318057d9233679c80 in your message.))
99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 160/162 [00:41<00:08,  4.04s/it][2023-05-28 03:32:07,900] [_common.py:105] Backing off openai_chat_completion_create_retrying(...) for 0.3s (openai.error.RateLimitError: That model is currently overloaded wi████▏| 161/162 [00:43<00:03,  3.45s/it][2023-05-28 03:32:46,743] [_common.py:105] Backing off openai_chat_completion_create_retrying(. include the request ID eb177298033e56fd..) for 1.7s (openai.error.RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID bd87c2c270ff9cc1327c97f422e78e████▏| 161/162 [00:43<00:03,  3.45s/it][44 in your message.))                                                                                                                 r: That model is currently overloaded wi
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ include the request ID bd87c2c270ff9cc1█████| 162/162 [01:23<00:00,  1.94it/s]
[2023-05-28 03:32:49,381] [record.py:341] Final report: {'accuracy': 0.2654320987654321}. Logged to /tmp/evallogs/2305280731254BGFE753█████| 162/162 [01:23<00:00,  1.94it/s] _gpt-3.5-turbo_identify_historical_text.jsonl                                                                                         _gpt-3.5-turbo_identify_historical_text.
[2023-05-28 03:32:49,381] [oaieval.py:147] Final report:
[2023-05-28 03:32:49,381] [oaieval.py:149] accuracy: 0.2654320987654321
[2023-05-28 03:32:49,396] [record.py:330] Logged 7 rows of events to /tmp/evallogs/2305280731254BGFE753_gpt-3.5-turbo_identify_historical_text.jsonl: insert_time=14.014ms

May 28 '23 15:05 udaykumar1997

Hi @luqman-openai, I was working on another eval (identify_historical_text) and I accidentally submitted a pull request. It aims to identify Historical Human-Made Writing, Citation, or Quote. I'm not sure if it's normal or appropriate to submit 2 evals in a single pull request. If you'd like to submit a fresh PR with this new eval, or if you'd like me to take any additional steps, please let me now. My apologies for the confusion. I want to thank @RogerThiede for inspiring the development of this eval, though his deep and insightful feedback.

Here's some key information about it:

Eval name Identify Historical Human-Made Writing, Citation, or Quote

Eval description This evaluation task is designed to assess an AI's ability to discern whether a given piece of text is a historical human-made writing, citation, or quote. The AI will be presented with a series of text snippets and will be required to categorize each into one of the following categories: popular quote, research work (articles, publications, scientific studies, etc), popular human-origin content, potential human-origin content, potential machine/unknown origin content.

What makes this a useful eval? In an era where AI-generated text is becoming increasingly sophisticated, the ability to distinguish between human-authored and machine-generated text is more important than ever. This evaluation task is particularly relevant in a variety of fields, including but not limited to education, research, publishing, and legal sectors. By identifying historical human-made writings, we can ensure the authenticity and originality of the content in these fields. Potential Applications and Use Cases:

Education: This evaluation task can be used to ensure the authenticity of student assignments and research papers, promoting original thinking and discouraging plagiarism.

Research: In academic research, it can be used to verify the originality of papers and articles, ensuring that the work is based on genuine research and not copied from other sources.

Publishing: In the publishing industry, it can be used to verify the authenticity of manuscripts, ensuring that the work is original and not copied from other sources.

Legal Sector: In the legal sector, it can be used to verify the authenticity of documents, ensuring that they are original and not forged or copied.

Historical Analysis: It can be used in historical analysis to verify the authenticity of historical documents, ensuring that they are genuine and not forgeries.

Journalism: In journalism, it can be used to verify the authenticity of articles and reports, ensuring that they are original and not copied from other sources.

Content Creation: In content creation, it can be used to verify the authenticity of content, ensuring that it is original and not copied from other sources.

Social Media Monitoring: It can be used in social media monitoring to verify the authenticity of posts and comments, ensuring that they are original and not copied from other sources.

Digital Forensics: In digital forensics, it can be used to verify the authenticity of digital content, ensuring that it is original and not copied from other sources.

Artificial Intelligence: In AI, it can be used to improve the ability of AI models to generate original content, by training them to recognize and avoid copying existing human-made writings.

Unique eval value The unique value of this evaluation task lies in its focus on the subtleties and nuances that distinguish human writing from AI-generated text. While it may not explicitly involve stylometric analysis, the task implicitly requires the AI to understand the style, tone, and context of the writing, which are key components of stylometry. Stylometry is the study of linguistic style, usually in written language, and is often used to attribute authorship to anonymous or disputed documents. In this task, the AI needs to identify whether a text is a historical human-made writing, citation, or quote. This involves recognizing the unique characteristics of human writing, such as the use of certain phrases, sentence structures, and stylistic choices that are often specific to a particular time period or author. This is similar to how stylometric analysis works, making it an integral part of this task. The dataset has been carefully crafted taking into account various stylometric parameters.

View evals in JSON Output of this eval, for your reference

@udaykumar1997 kindly create a separate PR for identify_historical_text eval and remove the related files from this PR. We don't recommend multiple eval's in the same PR.

May 29 '23 18:05 luqman-openai

@udaykumar1997 kindly create a separate PR for identify_historical_text eval and remove the related files from this PR. We don't recommend multiple eval's in the same PR.

Hi @luqman-openai, thanks for the update and clarification. I will submit a fresh PR for the identify_historical_text eval at a later point. As advised, I've removed the related files from this PR. Please let me know if any additional steps need to be taken from my side.

May 30 '23 23:05 udaykumar1997

Thanks for the updates. The eval looks interesting. I am approving this PR.

You should see GPT-4 API access enabled in your account in the few days.

@luqman-openai Thanks for the update. It has been a pleasure collaborating with you and @RogerThiede on this. I look forward to working together again.

May 31 '23 19:05 udaykumar1997

Keep in mind prior work like New AI classifier for indicating AI-written text settle on a fine-tuned classifier for this task.

@luqman-openai's suggestion for an eval which classifies whether a specific model, like gpt-3.5-turbo-0301 or text-davinci-003, produced the output is a much better idea. Be sure to only use output from a static model version rather than a continuously updated model.

It may also be interesting to use output from prompts that are specifically designed to fool the original AI vs Human Text Detector idea. Many of these prompts can be found online, but most ask the model to limit the perplexity and burstiness. The reason this original eval idea is flawed is because once you've trained the model to properly classify between two polar extremes, AI generated vs. Human generated, you have successfully taught the model to generate content that will fall into one of those two classifications just by asking it to. That's essentially how the attributes perplexity and burstiness were discovered and then tailoring the prompt to target specific values of those attributes.

May 31 '23 19:05 RogerThiede

@RogerThiede and @luqman-openai thanks for your support and valuable insights. Incorporating your feedback into this eval will enable us to tackle the problem in a more efficient and sustainable manner. I'll provide an update after the research and testing. Thanks for your time.

May 31 '23 20:05 udaykumar1997

Dear @luqman-openai and @RogerThiede,

Thank you for your insightful comments and suggestions. I understand the limitations you pointed out regarding the initial evaluation setup and appreciate the constructive feedback.

As per your suggestions, I have revised the evaluation to focus on detecting text generated by a specific AI model, text-davinci-003. This approach provides a more concrete basis for evaluation and aligns with the need to identify and differentiate AI-generated content in real-world contexts, such as education.

In the revised evaluation, the AI is tasked with determining whether a piece of text was generated by text-davinci-003. The text samples are diverse and are designed to challenge the model's discernment capabilities. This revised task not only provides a measure of the model's capabilities, but also has broader implications for maintaining academic integrity and understanding the limitations and capabilities of AI models.

I look forward to your review of the revised evaluation. Your guidance and expertise are invaluable in this process, and I appreciate your time and support.

Kind regards, Uday

Jun 04 '23 16:06 udaykumar1997

An eval of only positive evaluations trains a model to simply evaluate this specific system role as always being true. This is missing known responses from other models from the same prompt so that you have true negative evaluations. I would also suggest to include multiple responses for the same prompt.

The Washington Post recently published an article related to this: Detecting AI may be impossible. That’s a big problem for teachers.

Jun 04 '23 17:06 RogerThiede

Dear @RogerThiede,

I have addressed your concerns and suggestions.

An eval of only positive evaluations trains a model to simply evaluate this specific system role as always being true.

I have included prompts pertaining to extracts from real human works. The ideal response for these prompt would be negative. Ref Section 2 of the source.md file accompanying the samples.jsonl file.

This is missing known responses from other models from the same prompt so that you have true negative evaluations.

Per your suggestion, I've included outputs from gpt-3.5-turbo-0301. Ref Section 3 of the source.md file accompanying the samples.jsonl file.

I would also suggest to include multiple responses for the same prompt.

I've now included multiple responses (ref Output 1 and Output 2)

The Washington Post recently published an article related to this: Detecting AI may be impossible. That’s a big problem for teachers.

This is an interesting read. Thanks for sharing this. This further highlights the problem we're trying to solve and the gravity of the situation. We need to figure out a way equip teachers with adequate and accurate tools to tackle this issue. And we need to do so fast. I understand that this eval is not the complete solution, but I believe it's at least a step in the right direction.

I'm terribly sorry for bothering you on weekends and at odd times. Once again, thanks for your time, and feel free to let me know if you need any additional changes/steps from my side.

Jun 05 '23 06:06 udaykumar1997

Hi @luqman-openai,

Thanks for your kind response and feedback.

The prompt is asking to detect whether the text is generated by text-davinci-003 or not. The text-davinci-003 was released on November 28, 2022. I would suggest you to chose a model released before the knowledge cutoff of the GPT-3.5 and GPT-4 which is Spetember 2021. So, that the models have seen atleast some examples or you can make use of few_shot_jsonl argument to tune the model on some examples and then evaluate the model on test set after that.

After reading this, I initially considered using text-davinci-002, but figured since it was released in July 2021 (which is just 2 months from the knowledge cut-off date), GPT-3.5 and GPT-4 may not have been trained/been exposed to a lot of content generated by text-davinci-002. I tried tinkering with few_shot_jsonl argument to tune the model, but I need more time for that approach. After contemplating on this for quite some time, I chose to go with text-davinci-001 model. Since it was released in June 2020, I assume enough content would have been generated by it until the knowledge cutoff, Spetember 2021.

The eval is crashing because some samples provided are not valid JSON.

Please make sure the CI passes.

I've addressed them, sorry for the oversight. It should work now.

Thanks for your time and attention. I look forward to hearing from you soon.

Kind regards, Uday

Jun 12 '23 14:06 udaykumar1997

Hi @luqman-openai,

I hope all is well from your side. I'm writing to follow-up and see if you've had a chance to review the latest changes I've made (based on your previous feedback). I understand you may be busy with other commitments/responsibilities. Your time and attention is deeply appreciated.

Thanks. Kind regards, Uday

Jun 23 '23 01:06 udaykumar1997

Hi @usama-openai,

Thank you for your time and feedback. I've revised the instructions as you suggested, now requiring the model to provide step-by-step reasoning and a final answer in brackets. Ideal responses have also been adjusted accordingly. This modification should indeed give the model a fair chance to properly identify the text, and the specific formatting will assist in evaluating the final answer using the Includes method.

Kind regards, Uday

Jun 23 '23 23:06 udaykumar1997

You should see GPT-4 API access enabled in your account in the next few days.

Jul 04 '23 17:07 usama-openai

@usama-openai @luqman-openai @RogerThiede Thank you all very much. I'm grateful for your time, attention and support. Please don't hesitate to reach out and connect with me on LinkedIn :)

I look forward to working together again in the near future, perhaps on another research project.

Jul 04 '23 17:07 udaykumar1997

evals
evals copied to clipboard

Eval addition: AI vs Human Text Detector

Thank you for contributing an eval! ♥️

Eval details 📑

Eval name

Eval description

What makes this a useful eval?

Criteria for a good eval ✅

Unique eval value

Eval structure 🏗️

Final checklist 👀

Submission agreement

Email address validation

Limited availability acknowledgement

Submit eval

Eval JSON data

Eval

Eval

Eval

evals evals copied to clipboard

Eval addition: AI vs Human Text Detector

Thank you for contributing an eval! ♥️

Eval details 📑

Eval name

Eval description

What makes this a useful eval?

Criteria for a good eval ✅

Unique eval value

Eval structure 🏗️

Final checklist 👀

Submission agreement

Email address validation

Limited availability acknowledgement

Submit eval

Eval JSON data

Eval

Eval

Eval

evals
evals copied to clipboard