Addressing the elephant in the room

When the concept of transformers were first unleashed, their revolutionnary accuracy results where mostly shown in the standard NLP tasks, such as POS-tagging, dependency parsing, coreference resolution, WSD, etc.. But I've observed, since PALM and other very large language models, the published benchmarks results are on much higher level tasks, such as common sense reasoning tests, question answering, etc Both sets of benchmarks are useful and needed, but I would like to highlight that the standard NLP tasks are now completely under-benchmarked by those newer language models and that this impairs progress towards AGI or industrial uses.

If it could be argued, that purely symbolic AI progress has stalled since decades, there is a real huge potential for neuro-symbolic hybrid systems that uses neural networks for low level analysis tasks (POS-tag, etc), and feed those linguistic data to other higher level neural networks or to symbolic systems, in order to push the boundaries of what is possible, especially regarding semantic analysis AKA true NLU systems.

### foundational NLP tasks of interest:
- [ ] [Dependency parsing](https://paperswithcode.com/sota/dependency-parsing-on-penn-treebank)
- [ ] [word sense disambiguation](https://paperswithcode.com/sota/word-sense-disambiguation-on-supervised)
- [ ] [Coreference resolution](https://paperswithcode.com/sota/coreference-resolution-on-ontonotes)
- [ ] [POS tagging](https://paperswithcode.com/sota/part-of-speech-tagging-on-penn-treebank)
- [ ] others

Therefore this issue is a call of contributions for implementing evals on those standard tasks, especially dependency parsing. I believe GPT-4 has the potential to improve the SOTA in at least some foundational NLP tasks and an even greater potential once someone finetune it and combine it to domain specific optimizations (as is currently done on BERT SOTAs, such as HPSG for dependency parsing).

Mar 16 '23 14:03 LifeIsStrange

Great idea!

Apr 22 '23 15:04 andrew-openai

I know this thread is from a while back but curious if anyone has managed to do this?

Dec 14 '23 17:12 sudarshansivakumar

I tried to do this, still working on it! how about you?

чт, 14 дек. 2023 г. в 22:29, sudarshansivakumar @.***>:

I know this thread is from a while back but curious if anyone has managed to do this?

— Reply to this email directly, view it on GitHub https://github.com/openai/evals/issues/246#issuecomment-1856272676, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6W7PNBMGUYUP6OUPFUHDJ3YJMZPNAVCNFSM6AAAAAAV5IKFB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJWGI3TENRXGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Dec 14 '23 18:12 Mukhsin0508

evals
evals copied to clipboard

Evaluate GPT-4 on classical NLP tasks

Addressing the elephant in the room

evals evals copied to clipboard

Evaluate GPT-4 on classical NLP tasks

Addressing the elephant in the room

evals
evals copied to clipboard