ScandEval icon indicating copy to clipboard operation
ScandEval copied to clipboard

[BENCHMARK DATASET REQUEST] Danish Similarity Outlier Detection

Open Linguistcoder opened this issue 9 months ago • 3 comments

Dataset name

Danish Similarity Outlier Detection

Dataset link

https://github.com/kuhumcst/danish-semantic-reasoning-benchmark/tree/main/similarity

Dataset languages

  • [x] Danish
  • [ ] Dutch
  • [ ] English
  • [ ] Faroese
  • [ ] French
  • [ ] German
  • [ ] Icelandic
  • [ ] Italian
  • [ ] Norwegian (Bokmål or Nynorsk)
  • [ ] Spanish
  • [ ] Swedish

Describe the dataset

The dataset measures the ability to find the outlier among a list of words (which word is the least similar to the rest). Some examples:

  • ['droge', 'bregnerod', 'medicinbrug', 'kinabark', 'lægeurt', 'salvie']
  • ['kontantautomat', 'pengeautomat', 'dankortautomat', 'hæveautomat', 'bankomat', 'skranke']
  • ['nationalitet', 'rige', 'nation', 'stat', 'land', 'enkeltstat']

The data is extracted from The Danish Thesaurus as part the Danish Semantic Reasoning Benchmark.

The dataset tests the ability to distinguish word senses / meanings in Danish on different granularities on a broad selection of the vocabulary and thereby a combination of Danish skills and word knowledge is required to solve the task.

There are three granularities (coarse, medium, fine), but I would start with medium or coarse.

Linguistcoder avatar Apr 02 '25 13:04 Linguistcoder

Looks good! We could formulate it as a multiple-choice task with 6 choices, in which case it fits in with the existing tasks. This would fit in the knowledge category I'd reckon.

saattrupdan avatar Apr 02 '25 14:04 saattrupdan

Great! Do you need anything more from us?

Linguistcoder avatar Apr 02 '25 14:04 Linguistcoder

Hi @Linguistcoder

I just had a look at your dataset, but I can not open https://github.com/kuhumcst/danish-semantic-reasoning-benchmark/blob/main/similarity/similarity.zip as I don't have the required password. Would you be able to share the password or consider making this file publicly available?

oliverkinch avatar Jun 20 '25 07:06 oliverkinch