ScandEval [BENCHMARK DATASET REQUEST] Danish Similarity Outlier Detection

Dataset name

Danish Similarity Outlier Detection

Dataset link

https://github.com/kuhumcst/danish-semantic-reasoning-benchmark/tree/main/similarity

Dataset languages

[x] Danish
[ ] Dutch
[ ] English
[ ] Faroese
[ ] French
[ ] German
[ ] Icelandic
[ ] Italian
[ ] Norwegian (Bokmål or Nynorsk)
[ ] Spanish
[ ] Swedish

Describe the dataset

The dataset measures the ability to find the outlier among a list of words (which word is the least similar to the rest). Some examples:

['droge', 'bregnerod', 'medicinbrug', 'kinabark', 'lægeurt', 'salvie']
['kontantautomat', 'pengeautomat', 'dankortautomat', 'hæveautomat', 'bankomat', 'skranke']
['nationalitet', 'rige', 'nation', 'stat', 'land', 'enkeltstat']

The data is extracted from The Danish Thesaurus as part the Danish Semantic Reasoning Benchmark.

The dataset tests the ability to distinguish word senses / meanings in Danish on different granularities on a broad selection of the vocabulary and thereby a combination of Danish skills and word knowledge is required to solve the task.

There are three granularities (coarse, medium, fine), but I would start with medium or coarse.

Apr 02 '25 13:04 Linguistcoder

Looks good! We could formulate it as a multiple-choice task with 6 choices, in which case it fits in with the existing tasks. This would fit in the knowledge category I'd reckon.

Apr 02 '25 14:04 saattrupdan

Great! Do you need anything more from us?

Apr 02 '25 14:04 Linguistcoder

Hi @Linguistcoder

I just had a look at your dataset, but I can not open https://github.com/kuhumcst/danish-semantic-reasoning-benchmark/blob/main/similarity/similarity.zip as I don't have the required password. Would you be able to share the password or consider making this file publicly available?

Jun 20 '25 07:06 oliverkinch