scandinavian-embedding-benchmark icon indicating copy to clipboard operation
scandinavian-embedding-benchmark copied to clipboard

Extending the dataset to other languages

Open KennethEnevoldsen opened this issue 1 year ago • 0 comments

Extending the dataset to other Scandinavian languages

These resources should be checked before implementing on whether they are translated or not:

  • Greenlandic
    • Danish-Greenlandic
    • Greenlandic news
  • Icelandic
    • QA: https://huggingface.co/datasets/vesteinn/icelandic-qa-NQiI
    • ScaLA
    • News: https://huggingface.co/datasets/thors/RRN
    • https://huggingface.co/datasets/mideind/icelandic-error-corpus-IceEC
    • https://huggingface.co/datasets/vesteinn/icelandic-parallel-abstracts-corpus-IPAC
    • Potentially: https://huggingface.co/datasets/mideind/icelandic-english-translation and its reverse: https://huggingface.co/datasets/mideind/english-icelandic-translation
    • translation with localization: https://huggingface.co/datasets/mideind/icelandic-winogrande
    • Unsure what this is: https://huggingface.co/datasets/mideind/icelandic-sentences-gec
  • Faroese
    • ScaLA
    • https://huggingface.co/datasets/strombergnlp/itu_faroese_danish
    • (translation): https://huggingface.co/datasets/vesteinn/faroese-sts
    • https://huggingface.co/datasets/vesteinn/faroese-parallel-bible
    • potentially some structure in: https://www.openslr.org/125/

Potentially Finnish as well:

  • https://github.com/TurkuNLP/FIN-bench/tree/main/benchmark_tasks/emotions (the remainder of the dataset seems to be translated)
  • Potentially also the open Assistant datasets: https://huggingface.co/datasets/mkayhko/oasst2-finnish-threads

Anything beyond this the benchmark should probably be renamed.

KennethEnevoldsen avatar Mar 03 '24 16:03 KennethEnevoldsen