ScandEval [BENCHMARK DATASET REQUEST] Schibsted Summaries

Dataset name

Schibsted Summaries

Dataset link

https://huggingface.co/datasets/Schibsted/schibsted-article-summaries

Dataset languages

[ ] Danish
[X] Swedish
[X] Norwegian (Bokmål or Nynorsk)
[ ] Icelandic
[ ] Faroese
[ ] German
[ ] Dutch
[ ] English

Describe the dataset

This is a json-formatted dataset of articles and human-created (as I gather) summaries from the Schibsted corporation in Norwegian and Swedish.

$ grep -c ""summary"" * README.md:0 summary-data-test.jsonl:517 summary-data-train.jsonl:2000 summary-data-validation.jsonl:491

Sep 26 '24 08:09 larsbun

@larsbun Is it possible to separate the Norwegian/Swedish samples in the dataset, or would we have to use a language identification model?

Sep 26 '24 09:09 saattrupdan

@larsbun Is it possible to separate the Norwegian/Swedish samples in the dataset, or would we have to use a language identification model?

I didn't know the dataset before someone told me at dinner on Tuesday (i.e., haven't worked on it), but a crude regexp search of the fields gave me this:

article_id newsroom article_title article_text_all summary num_bulletpoints num_words_per_bulletpoint num_words_per_bulletpoint_bucket num_bulletpoints_bucket

Seemingly, there is no language ID there. But I guess a ID model should separate them to a 100%, since the alphabets are different. I just looked through the file and saw some Swedish text.

Sep 26 '24 12:09 larsbun

hey, was notified of this now. didnt give any language identifier, but its pretty easy to do that based on newsroom

Nov 01 '24 11:11 simeneide

hey, was notified of this now. didnt give any language identifier, but its pretty easy to do that based on newsroom

Are you familiar with which newsrooms that correspond to which languages? I can then make a PR where we use this hardcoded mapping instead of being dependent on a language classifier.

Here are the 13 newsroom abbreviations:

{'sno-commercial', 'vektklubb', 'e24', 'e24partnerstudio', 'dinepenger', 'vgpartnerstudio', 'bt', 'tekno', 'vg', 'ap', 'randaberg24', 'ab', 'sa'}

Nov 01 '24 12:11 oliverkinch

I should be as i work there :D

{
'sno-commercial' : 'no', 
'vektklubb' : 'no',
'e24' : 'no', 
'e24partnerstudio' : 'no', 
'dinepenger' : 'no', 
'vgpartnerstudio' : 'no', 
'bt' : 'no', 
'tekno' : 'no', 
'vg' : 'no', 
'ap' : 'no', 
'randaberg24' : 'no', 
'ab' : 'se', 
'sa' : 'no'}

Nov 01 '24 12:11 simeneide