scandinavian-embedding-benchmark icon indicating copy to clipboard operation
scandinavian-embedding-benchmark copied to clipboard

A Scandinavian Benchmark for sentence embeddings

Results 26 scandinavian-embedding-benchmark issues
Sort by recently updated
recently updated
newest added

https://aclanthology.org/2023.nodalida-1.61/

dataset

Add a naive baseline model for 7b model. E.g. one of the best performing models on ScandEval. Potentially take a look at: https://github.com/vllm-project/vllm/issues/1654

model

One way to do this is to add create a gradio app and embed it. This would allow for much more user customization in the averaging.

documentation

Seems like Scandisent is a valid cross-lingual dataset for the Scandinavian languages. https://github.com/timpal0l/ScandiSent?tab=readme-ov-file

dataset

Some models such as the "translate and embed" models can't be used for cross-lingual tasks, ideally their scores should just be nan. I am unsure what the best solution is....

Might be interesting to add author-style clustering based on: https://huggingface.co/datasets/MiMe-MeMo/Corpus-v1.1

dataset

Add the: sentence-transformers/use-cmlm-multilingual as it performs well on ScandEval

model

Add a time x performance plot to the website. This allows us to see how performance has developed over time. This requires us to add a date to each of...

documentation

The current implementation of the evaluators only gives a singular score. This makes it hard to see the uncertainty in the scores. A potential solution is bootstrapping on the document...

enhancement