scandinavian-embedding-benchmark issues

Create a Gradio app, hosted on Huggingface

KennethEnevoldsen

documentation

Check that there if there is datasets to add from here:

2

https://aclanthology.org/2023.nodalida-1.61/

KennethEnevoldsen

dataset

Add a naive baseline model for 7b model

Add a naive baseline model for 7b model. E.g. one of the best performing models on ScandEval. Potentially take a look at: https://github.com/vllm-project/vllm/issues/1654

KennethEnevoldsen

model

Split norwegian into nynorsk and bokmål

One way to do this is to add create a gradio app and embed it. This would allow for much more user customization in the averaging.

KennethEnevoldsen

documentation

Add ScandiSent

Seems like Scandisent is a valid cross-lingual dataset for the Scandinavian languages. https://github.com/timpal0l/ScandiSent?tab=readme-ov-file

KennethEnevoldsen

dataset

Find a solution allowing empty results

1

Some models such as the "translate and embed" models can't be used for cross-lingual tasks, ideally their scores should just be nan. I am unsure what the best solution is....

KennethEnevoldsen

Author style clustering?

1

Might be interesting to add author-style clustering based on: https://huggingface.co/datasets/MiMe-MeMo/Corpus-v1.1

KennethEnevoldsen

dataset

Add the sentence-transformers/use-cmlm-multilingual

Add the: sentence-transformers/use-cmlm-multilingual as it performs well on ScandEval

KennethEnevoldsen

model

Add a time x performance plot to the website

2

Add a time x performance plot to the website. This allows us to see how performance has developed over time. This requires us to add a date to each of...

KennethEnevoldsen

documentation

Bootstrapping scores for an uncertainty estimate

1

The current implementation of the evaluators only gives a singular score. This makes it hard to see the uncertainty in the scores. A potential solution is bootstrapping on the document...

KennethEnevoldsen

enhancement

scandinavian-embedding-benchmark
scandinavian-embedding-benchmark copied to clipboard

Metadata

Create a Gradio app, hosted on Huggingface

Check that there if there is datasets to add from here:

Add a naive baseline model for 7b model

Split norwegian into nynorsk and bokmål

Add ScandiSent

Find a solution allowing empty results

Author style clustering?

Add the sentence-transformers/use-cmlm-multilingual

Add a time x performance plot to the website

Bootstrapping scores for an uncertainty estimate

← Metadata

Owner

Metadata

scandinavian-embedding-benchmark scandinavian-embedding-benchmark copied to clipboard

Metadata

← Metadata

Owner

Metadata

scandinavian-embedding-benchmark
scandinavian-embedding-benchmark copied to clipboard