[BENCHMARK DATASET REQUEST] Danish Similarity Outlier Detection
Dataset name
Danish Similarity Outlier Detection
Dataset link
https://github.com/kuhumcst/danish-semantic-reasoning-benchmark/tree/main/similarity
Dataset languages
- [x] Danish
- [ ] Dutch
- [ ] English
- [ ] Faroese
- [ ] French
- [ ] German
- [ ] Icelandic
- [ ] Italian
- [ ] Norwegian (Bokmål or Nynorsk)
- [ ] Spanish
- [ ] Swedish
Describe the dataset
The dataset measures the ability to find the outlier among a list of words (which word is the least similar to the rest). Some examples:
- ['droge', 'bregnerod', 'medicinbrug', 'kinabark', 'lægeurt', 'salvie']
- ['kontantautomat', 'pengeautomat', 'dankortautomat', 'hæveautomat', 'bankomat', 'skranke']
- ['nationalitet', 'rige', 'nation', 'stat', 'land', 'enkeltstat']
The data is extracted from The Danish Thesaurus as part the Danish Semantic Reasoning Benchmark.
The dataset tests the ability to distinguish word senses / meanings in Danish on different granularities on a broad selection of the vocabulary and thereby a combination of Danish skills and word knowledge is required to solve the task.
There are three granularities (coarse, medium, fine), but I would start with medium or coarse.
Looks good! We could formulate it as a multiple-choice task with 6 choices, in which case it fits in with the existing tasks. This would fit in the knowledge category I'd reckon.
Great! Do you need anything more from us?
Hi @Linguistcoder
I just had a look at your dataset, but I can not open https://github.com/kuhumcst/danish-semantic-reasoning-benchmark/blob/main/similarity/similarity.zip as I don't have the required password. Would you be able to share the password or consider making this file publicly available?