ScandEval icon indicating copy to clipboard operation
ScandEval copied to clipboard

[BENCHMARK DATASET REQUEST] Schibsted Summaries

Open larsbun opened this issue 1 year ago • 2 comments

Dataset name

Schibsted Summaries

Dataset link

https://huggingface.co/datasets/Schibsted/schibsted-article-summaries

Dataset languages

  • [ ] Danish
  • [X] Swedish
  • [X] Norwegian (Bokmål or Nynorsk)
  • [ ] Icelandic
  • [ ] Faroese
  • [ ] German
  • [ ] Dutch
  • [ ] English

Describe the dataset

This is a json-formatted dataset of articles and human-created (as I gather) summaries from the Schibsted corporation in Norwegian and Swedish.

$ grep -c ""summary"" * README.md:0 summary-data-test.jsonl:517 summary-data-train.jsonl:2000 summary-data-validation.jsonl:491

larsbun avatar Sep 26 '24 08:09 larsbun

@larsbun Is it possible to separate the Norwegian/Swedish samples in the dataset, or would we have to use a language identification model?

saattrupdan avatar Sep 26 '24 09:09 saattrupdan

@larsbun Is it possible to separate the Norwegian/Swedish samples in the dataset, or would we have to use a language identification model?

I didn't know the dataset before someone told me at dinner on Tuesday (i.e., haven't worked on it), but a crude regexp search of the fields gave me this:

article_id newsroom article_title article_text_all summary num_bulletpoints num_words_per_bulletpoint num_words_per_bulletpoint_bucket num_bulletpoints_bucket

Seemingly, there is no language ID there. But I guess a ID model should separate them to a 100%, since the alphabets are different. I just looked through the file and saw some Swedish text.

larsbun avatar Sep 26 '24 12:09 larsbun

hey, was notified of this now. didnt give any language identifier, but its pretty easy to do that based on newsroom

simeneide avatar Nov 01 '24 11:11 simeneide

hey, was notified of this now. didnt give any language identifier, but its pretty easy to do that based on newsroom

Are you familiar with which newsrooms that correspond to which languages? I can then make a PR where we use this hardcoded mapping instead of being dependent on a language classifier.

Here are the 13 newsroom abbreviations:

{'sno-commercial', 'vektklubb', 'e24', 'e24partnerstudio', 'dinepenger', 'vgpartnerstudio', 'bt', 'tekno', 'vg', 'ap', 'randaberg24', 'ab', 'sa'}

oliverkinch avatar Nov 01 '24 12:11 oliverkinch

I should be as i work there :D

{
'sno-commercial' : 'no', 
'vektklubb' : 'no',
'e24' : 'no', 
'e24partnerstudio' : 'no', 
'dinepenger' : 'no', 
'vgpartnerstudio' : 'no', 
'bt' : 'no', 
'tekno' : 'no', 
'vg' : 'no', 
'ap' : 'no', 
'randaberg24' : 'no', 
'ab' : 'se', 
'sa' : 'no'}

simeneide avatar Nov 01 '24 12:11 simeneide