scandinavian-embedding-benchmark
scandinavian-embedding-benchmark copied to clipboard
Extending the dataset to other languages
Extending the dataset to other Scandinavian languages
These resources should be checked before implementing on whether they are translated or not:
- Greenlandic
- Danish-Greenlandic
- Greenlandic news
- Icelandic
- QA: https://huggingface.co/datasets/vesteinn/icelandic-qa-NQiI
- ScaLA
- News: https://huggingface.co/datasets/thors/RRN
- https://huggingface.co/datasets/mideind/icelandic-error-corpus-IceEC
- https://huggingface.co/datasets/vesteinn/icelandic-parallel-abstracts-corpus-IPAC
- Potentially: https://huggingface.co/datasets/mideind/icelandic-english-translation and its reverse: https://huggingface.co/datasets/mideind/english-icelandic-translation
- translation with localization: https://huggingface.co/datasets/mideind/icelandic-winogrande
- Unsure what this is: https://huggingface.co/datasets/mideind/icelandic-sentences-gec
- Faroese
- ScaLA
- https://huggingface.co/datasets/strombergnlp/itu_faroese_danish
- (translation): https://huggingface.co/datasets/vesteinn/faroese-sts
- https://huggingface.co/datasets/vesteinn/faroese-parallel-bible
- potentially some structure in: https://www.openslr.org/125/
Potentially Finnish as well:
- https://github.com/TurkuNLP/FIN-bench/tree/main/benchmark_tasks/emotions (the remainder of the dataset seems to be translated)
- Potentially also the open Assistant datasets: https://huggingface.co/datasets/mkayhko/oasst2-finnish-threads
Anything beyond this the benchmark should probably be renamed.