ScandEval icon indicating copy to clipboard operation
ScandEval copied to clipboard

[MODEL EVALUATION REQUEST] google/gemma-2-27b

Open Mikeriess opened this issue 1 year ago • 2 comments

Model ID

google/gemma-2-27b

Model type

Decoder model (e.g., GPT)

Model languages

  • [x] Danish
  • [x] Swedish
  • [x] Norwegian (Bokmål or Nynorsk)
  • [x] Icelandic
  • [x] Faroese
  • [x] German
  • [x] Dutch
  • [X] English

Merged model

Not a merged model

Mikeriess avatar Sep 25 '24 07:09 Mikeriess

Norwegian results are as follows - I promise to run the rest of the languages later this week, I just needed the norwegian ones myself right now :-)

{"dataset": "norec", "task": "sentiment-classification", "dataset_languages": ["nb", "nn", "no"], "model": "google/gemma-2-27b", "results": {"raw": {"test": [{"mcc": 0.6554253529627034, "macro_f1": 0.7760934973361779}, {"mcc": 0.6385532148777571, "macro_f1": 0.7575816622938761}, {"mcc": 0.6817564621890219, "macro_f1": 0.7912456669939866}, {"mcc": 0.652680596372727, "macro_f1": 0.7672986421753247}, {"mcc": 0.6019775344436946, "macro_f1": 0.7381669300885362}, {"mcc": 0.6420528885923462, "macro_f1": 0.7567231201807183}, {"mcc": 0.6305808470850875, "macro_f1": 0.7520194965623133}, {"mcc": 0.6686464754931485, "macro_f1": 0.7880332686021218}, {"mcc": 0.5743934698730302, "macro_f1": 0.701980630886999}, {"mcc": 0.6692597116597151, "macro_f1": 0.7851090034319618}]}, "total": {"test_mcc": 64.15326553549232, "test_mcc_se": 2.0285505307888254, "test_macro_f1": 76.14251918552016, "test_macro_f1_se": 1.6794329371673604}}, "num_model_parameters": 27227128320, "max_sequence_length": 8193, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "13.0.0"}
{"dataset": "norne-nb", "task": "named-entity-recognition", "dataset_languages": ["nb", "no"], "model": "google/gemma-2-27b", "results": {"raw": {"test": [{"micro_f1_no_misc": 0.3901147396293027, "micro_f1": 0.40273520396132984}, {"micro_f1_no_misc": 0.3754142814100633, "micro_f1": 0.4086444007858546}, {"micro_f1_no_misc": 0.41122504047490555, "micro_f1": 0.4016118200134318}, {"micro_f1_no_misc": 0.4425666754687087, "micro_f1": 0.4501460158531498}, {"micro_f1_no_misc": 0.4478744939271255, "micro_f1": 0.43752647183396864}, {"micro_f1_no_misc": 0.4576354679802956, "micro_f1": 0.46631689401888776}, {"micro_f1_no_misc": 0.43914610479622956, "micro_f1": 0.4919717405266539}, {"micro_f1_no_misc": 0.4580737901293723, "micro_f1": 0.42406542056074764}, {"micro_f1_no_misc": 0.40743871513102287, "micro_f1": 0.41283422459893054}, {"micro_f1_no_misc": 0.428163159262146, "micro_f1": 0.41038756639779655}]}, "total": {"test_micro_f1_no_misc": 42.57652468209172, "test_micro_f1_no_misc_se": 1.7735220250835557, "test_micro_f1": 43.062397585507505, "test_micro_f1_se": 1.886525290702595}}, "num_model_parameters": 27227128320, "max_sequence_length": 8320, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "13.0.0"}
{"dataset": "norne-nn", "task": "named-entity-recognition", "dataset_languages": ["nn"], "model": "google/gemma-2-27b", "results": {"raw": {"test": [{"micro_f1_no_misc": 0.4295868502887605, "micro_f1": 0.42539682539682544}, {"micro_f1_no_misc": 0.4163805787150564, "micro_f1": 0.40653728294177727}, {"micro_f1_no_misc": 0.41632088520055327, "micro_f1": 0.40033535946342486}, {"micro_f1_no_misc": 0.3812641651976832, "micro_f1": 0.3673734314149719}, {"micro_f1_no_misc": 0.4028268551236749, "micro_f1": 0.3962968080696421}, {"micro_f1_no_misc": 0.40215792054928884, "micro_f1": 0.44215438460042783}, {"micro_f1_no_misc": 0.4066666666666666, "micro_f1": 0.42644095122934306}, {"micro_f1_no_misc": 0.408695652173913, "micro_f1": 0.40670157068062834}, {"micro_f1_no_misc": 0.39099927413501084, "micro_f1": 0.4004291845493563}, {"micro_f1_no_misc": 0.38817285822592873, "micro_f1": 0.37035726918995393}]}, "total": {"test_micro_f1_no_misc": 40.43071706276537, "test_micro_f1_no_misc_se": 0.907449560262362, "test_micro_f1": 40.42023067536351, "test_micro_f1_se": 1.459342938180053}}, "num_model_parameters": 27227128320, "max_sequence_length": 8320, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "13.0.0"}
{"dataset": "scala-nb", "task": "linguistic-acceptability", "dataset_languages": ["nb", "no"], "model": "google/gemma-2-27b", "results": {"raw": {"test": [{"mcc": 0.48272014576736705, "macro_f1": 0.682304777939359}, {"mcc": 0.6264106575722953, "macro_f1": 0.8124597582298295}, {"mcc": 0.4043483749902117, "macro_f1": 0.6115020445649559}, {"mcc": 0.36804534719519383, "macro_f1": 0.6102555802537137}, {"mcc": 0.5631786103496347, "macro_f1": 0.7737857944721387}, {"mcc": 0.631690180085171, "macro_f1": 0.8078207428217306}, {"mcc": 0.5415521716452729, "macro_f1": 0.765373483734701}, {"mcc": 0.6130206736356687, "macro_f1": 0.8037770002907811}, {"mcc": 0.4940761167753505, "macro_f1": 0.7257714835532694}, {"mcc": 0.43386563277602064, "macro_f1": 0.6637724615818281}]}, "total": {"test_mcc": 51.58907910792186, "test_mcc_se": 5.863908957604346, "test_macro_f1": 72.56823127442307, "test_macro_f1_se": 4.903921624872546}}, "num_model_parameters": 27227128320, "max_sequence_length": 8193, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "13.0.0"}
{"dataset": "scala-nn", "task": "linguistic-acceptability", "dataset_languages": ["nn"], "model": "google/gemma-2-27b", "results": {"raw": {"test": [{"mcc": 0.42547310607172706, "macro_f1": 0.6904605425159931}, {"mcc": 0.5283547826384309, "macro_f1": 0.7640021042611288}, {"mcc": 0.4674254571482998, "macro_f1": 0.7237452812674052}, {"mcc": 0.4469217578506206, "macro_f1": 0.7203574975173783}, {"mcc": 0.3593184230016558, "macro_f1": 0.6327520597379817}, {"mcc": 0.36620826010578467, "macro_f1": 0.6459713822461424}, {"mcc": 0.39060963307471, "macro_f1": 0.6763688257620466}, {"mcc": 0.3920286051760315, "macro_f1": 0.6611136207404498}, {"mcc": 0.4941940914020561, "macro_f1": 0.7440857117511215}, {"mcc": 0.37028900367391376, "macro_f1": 0.6174735721096494}]}, "total": {"test_mcc": 42.4082312014323, "test_mcc_se": 3.6244375579295283, "test_macro_f1": 68.76330597909298, "test_macro_f1_se": 3.0575892339296686}}, "num_model_parameters": 27227128320, "max_sequence_length": 8193, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "13.0.0"}
{"dataset": "norquad", "task": "reading-comprehension", "dataset_languages": ["nb", "nn", "no"], "model": "google/gemma-2-27b", "results": {"raw": {"test": [{"em": 65.83953680727875, "f1": 84.39980416515665}, {"em": 38.76357560568087, "f1": 64.12544098228207}, {"em": 65.41353383458646, "f1": 83.5091236906422}, {"em": 58.14722911497105, "f1": 82.71218138084198}, {"em": 65.48890714872638, "f1": 83.87416009636368}, {"em": 69.96699669966996, "f1": 86.68344639892516}, {"em": 63.314097279472385, "f1": 84.03083850623298}, {"em": 59.700249791840136, "f1": 81.56755050090068}, {"em": 41.854636591478695, "f1": 69.95280657772234}, {"em": 57.03517587939699, "f1": 81.19658910819783}]}, "total": {"test_em": 58.552393875310166, "test_em_se": 6.452658155124408, "test_f1": 80.20519414072655, "test_f1_se": 4.4855465763129745}}, "num_model_parameters": 27227128320, "max_sequence_length": 8224, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "13.0.0"}
{"dataset": "no-sammendrag", "task": "summarization", "dataset_languages": ["nb", "nn", "no"], "model": "google/gemma-2-27b", "results": {"raw": {"test": [{"bertscore": 0.6784601474355441, "rouge_l": 0.23940170625620996}, {"bertscore": 0.6668498278304469, "rouge_l": 0.21325681024263077}, {"bertscore": 0.6768509268003982, "rouge_l": 0.23440105524710653}, {"bertscore": 0.6295910554763395, "rouge_l": 0.17297840470383186}, {"bertscore": 0.6811683000705671, "rouge_l": 0.24444847982677503}, {"bertscore": 0.6755348383885575, "rouge_l": 0.227886466847326}, {"bertscore": 0.675693364450126, "rouge_l": 0.23514369419004405}, {"bertscore": 0.6809803352807648, "rouge_l": 0.2402844796428253}, {"bertscore": 0.6713758167461492, "rouge_l": 0.22359865824535055}, {"bertscore": 0.6478132130141603, "rouge_l": 0.18698106677310783}]}, "total": {"test_bertscore": 66.84317825493054, "test_bertscore_se": 1.0410813935974532, "test_rouge_l": 22.183808219752077, "test_rouge_l_se": 1.491325258638892}}, "num_model_parameters": 27227128320, "max_sequence_length": 8448, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "13.0.0"}
{"dataset": "mmlu-no", "task": "knowledge", "dataset_languages": ["nb", "nn", "no"], "model": "google/gemma-2-27b", "results": {"raw": {"test": [{"mcc": 0.5462770400069564, "accuracy": 0.6552734375}, {"mcc": 0.5912161873204425, "accuracy": 0.69140625}, {"mcc": 0.5725212997396886, "accuracy": 0.677734375}, {"mcc": 0.5793193748774299, "accuracy": 0.68359375}, {"mcc": 0.5573848254207968, "accuracy": 0.66455078125}, {"mcc": 0.5917828567359295, "accuracy": 0.69189453125}, {"mcc": 0.5685289454046886, "accuracy": 0.67529296875}, {"mcc": 0.5719021740992959, "accuracy": 0.67724609375}, {"mcc": 0.5796405752445077, "accuracy": 0.681640625}, {"mcc": 0.5703141641121774, "accuracy": 0.67724609375}]}, "total": {"test_mcc": 57.28887442961913, "test_mcc_se": 0.8655426070216309, "test_accuracy": 67.7587890625, "test_accuracy_se": 0.6918996916280645}}, "num_model_parameters": 27227128320, "max_sequence_length": 8193, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "13.0.0"}
{"dataset": "hellaswag-no", "task": "common-sense-reasoning", "dataset_languages": ["nb", "nn", "no"], "model": "google/gemma-2-27b", "results": {"raw": {"test": [{"mcc": 0.40330057519388113, "accuracy": 0.5009765625}, {"mcc": 0.5222201003775869, "accuracy": 0.6279296875}, {"mcc": 0.6138611733120316, "accuracy": 0.703125}, {"mcc": 0.6735138033998871, "accuracy": 0.75244140625}, {"mcc": 0.4786205167713014, "accuracy": 0.57861328125}, {"mcc": 0.4397150945444776, "accuracy": 0.5439453125}, {"mcc": 0.5643671320003107, "accuracy": 0.65869140625}, {"mcc": 0.5335447577992785, "accuracy": 0.64453125}, {"mcc": 0.6126614868888981, "accuracy": 0.701171875}, {"mcc": 0.5315978269892824, "accuracy": 0.64404296875}]}, "total": {"test_mcc": 53.73402467276935, "test_mcc_se": 5.14533669438358, "test_accuracy": 63.55468749999999, "test_accuracy_se": 4.757496796677776}}, "num_model_parameters": 27227128320, "max_sequence_length": 8193, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "13.0.0"}
{"dataset": "speed", "task": "speed", "dataset_languages": ["ab", "aa", "af", "sq", "am", "ar", "an", "hy", "as", "av", "ae", "ay", "az", "bm", "ba", "eu", "be", "bn", "bi", "bs", "br", "bg", "my", "ca", "ch", "ce", "ny", "zh", "cu", "cv", "kw", "co", "cr", "hr", "cs", "da", "dv", "nl", "dz", "en", "eo", "et", "ee", "fo", "fj", "fi", "fr", "fy", "ff", "gd", "gl", "lg", "ka", "de", "el", "kl", "gn", "gu", "ht", "ha", "he", "hz", "hi", "ho", "hu", "is", "io", "ig", "id", "ia", "ie", "iu", "ik", "ga", "it", "ja", "kn", "kr", "ks", "kk", "km", "ki", "rw", "ky", "kv", "kg", "ko", "kj", "ku", "lo", "la", "lv", "li", "ln", "lt", "lu", "lb", "mk", "mg", "ms", "ml", "mt", "gv", "mi", "mr", "mh", "mn", "na", "nv", "nd", "nr", "ng", "ne", "no", "nb", "nn", "ii", "oc", "oj", "or", "om", "os", "pi", "ps", "fa", "pl", "pt", "pa", "qu", "ro", "rm", "rn", "ru", "se", "sm", "sg", "sa", "sc", "sr", "sn", "sd", "si", "sk", "sl", "so", "st", "es", "su", "sw", "ss", "sv", "tl", "ty", "tg", "ta", "tt", "te", "th", "bo", "ti", "to", "ts", "tn", "tr", "tk", "tw", "ug", "uk", "ur", "uz", "ve", "vi", "vo", "wa", "cy", "wo", "xh", "yi", "yo", "za", "zu"], "model": "google/gemma-2-27b", "results": {"raw": {"test": [{"test_speed": 667.92, "test_speed_short": 103.07}, {"test_speed": 1070.16, "test_speed_short": 182.39999999999998}, {"test_speed": 1210.4, "test_speed_short": 321.48}, {"test_speed": 1462.48, "test_speed_short": 390.1}, {"test_speed": 1631.72, "test_speed_short": 456.40000000000003}, {"test_speed": 1696.46, "test_speed_short": 557.22}, {"test_speed": 1858.08, "test_speed_short": 628.3100000000001}, {"test_speed": 1913.3, "test_speed_short": 693.68}, {"test_speed": 1859.48, "test_speed_short": 746.39}, {"test_speed": 1939.3, "test_speed_short": 811.8}]}, "total": {"test_speed": 1530.9299999999998, "test_speed_se": 263.77390050387044, "test_speed_short": 489.0849999999999, "test_speed_short_se": 148.73301896826482}}, "num_model_parameters": 27227128320, "max_sequence_length": 8193, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "13.0.0"}

Mikeriess avatar Sep 26 '24 05:09 Mikeriess

@Mikeriess Added the Norwegian results now 🙂

saattrupdan avatar Sep 26 '24 10:09 saattrupdan

@saattrupdan does this look right to you? Are these all the benchmarks? :-)

Edit: This must be correct, as it finally stopped by itself. What got me confused was that the VRAM was cleared some time after producing the last results.


{"dataset": "swerec", "task": "sentiment-classification", "dataset_languages": ["sv"], "model": "meta-llama/Llama-3.3-70B-Instruct", "results": {"raw": [{"mcc": 0.7930893855596791, "macro_f1": 0.7672619047619048}, {"mcc": 0.8031605176646656, "macro_f1": 0.7968459755693799}, {"mcc": 0.7949760668429149, "macro_f1": 0.780081661969637}, {"mcc": 0.7648144704303965, "macro_f1": 0.7835589396503102}, {"mcc": 0.8677960800399562, "macro_f1": 0.8697172811092949}, {"mcc": 0.7253499508215652, "macro_f1": 0.7657273501023161}, {"mcc": 0.7405892601791131, "macro_f1": 0.7575159177438238}, {"mcc": 0.7934806108064297, "macro_f1": 0.8176691816961316}, {"mcc": 0.8396052039639403, "macro_f1": 0.8058667449225388}, {"mcc": 0.8102253952020522, "macro_f1": 0.7934153787037829}], "total": {"test_mcc": 79.33086941510713, "test_mcc_se": 2.636706245086354, "test_macro_f1": 79.37660336229119, "test_macro_f1_se": 2.023239794742445}}, "num_model_parameters": 70553706496, "max_sequence_length": 131072, "vocabulary_size": 128256, "generative": true, "few_shot": true, "validation_split": true, "scandeval_version": "14.0.3"}
{"dataset": "swerec", "task": "sentiment-classification", "dataset_languages": ["sv"], "model": "meta-llama/Llama-3.3-70B-Instruct", "results": {"raw": [{"mcc": 0.8202904742852445, "macro_f1": 0.8102835209379565}, {"mcc": 0.8048726863426076, "macro_f1": 0.7865724875105843}, {"mcc": 0.8216386170053912, "macro_f1": 0.8130688249851704}, {"mcc": 0.79451804158918, "macro_f1": 0.7893380563307625}, {"mcc": 0.8173987376370678, "macro_f1": 0.8267919825658266}, {"mcc": 0.8239806649146575, "macro_f1": 0.8172908607380079}, {"mcc": 0.8092063357590406, "macro_f1": 0.7774720624557404}, {"mcc": 0.8163682074979931, "macro_f1": 0.8136247222521206}, {"mcc": 0.8117184404274859, "macro_f1": 0.7989700747595485}, {"mcc": 0.7957097928293555, "macro_f1": 0.7930126092063553}], "total": {"test_mcc": 81.15701998288024, "test_mcc_se": 0.6471706807711762, "test_macro_f1": 80.26425201742073, "test_macro_f1_se": 0.9822841338378447}}, "num_model_parameters": 70553706496, "max_sequence_length": 131072, "vocabulary_size": 128256, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "swerec", "task": "sentiment-classification", "dataset_languages": ["sv"], "model": "google/gemma-2-27b-it", "results": {"raw": [{"mcc": 0.8017507831519932, "macro_f1": 0.8011377345148412}, {"mcc": 0.7918663676858606, "macro_f1": 0.7815195277694044}, {"mcc": 0.7961165687002842, "macro_f1": 0.7966713532508637}, {"mcc": 0.7821206259380749, "macro_f1": 0.7826672564484474}, {"mcc": 0.7970698194099559, "macro_f1": 0.810505255784615}, {"mcc": 0.8053682935217332, "macro_f1": 0.8071882736964343}, {"mcc": 0.803161990092723, "macro_f1": 0.7804232219658775}, {"mcc": 0.784993336424854, "macro_f1": 0.7948313852587914}, {"mcc": 0.7925049976775663, "macro_f1": 0.7834595473266003}, {"mcc": 0.7732901025516595, "macro_f1": 0.7820558972399322}], "total": {"test_mcc": 79.28242885154705, "test_mcc_se": 0.6304792819077044, "test_macro_f1": 79.20459453255808, "test_macro_f1_se": 0.7124719723271653}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "swerec", "task": "sentiment-classification", "dataset_languages": ["sv"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.8239937843212695, "macro_f1": 0.8125677917163253}, {"mcc": 0.7885882874420214, "macro_f1": 0.7355518640014477}, {"mcc": 0.8215669219145454, "macro_f1": 0.8050186961209045}, {"mcc": 0.787375090232704, "macro_f1": 0.7545337323442016}, {"mcc": 0.8020063455999195, "macro_f1": 0.8008867411289396}, {"mcc": 0.822917271997569, "macro_f1": 0.8046223327223592}, {"mcc": 0.8117402036802612, "macro_f1": 0.7661512461608299}, {"mcc": 0.8275198363790907, "macro_f1": 0.8248434816600801}, {"mcc": 0.8097936943621004, "macro_f1": 0.766831577951712}, {"mcc": 0.796939447530873, "macro_f1": 0.7769492671923408}], "total": {"test_mcc": 80.92440883460355, "test_mcc_se": 0.9252458664179304, "test_macro_f1": 78.47956730999141, "test_macro_f1_se": 1.7902685675096426}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "angry-tweets", "task": "sentiment-classification", "dataset_languages": ["da"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.5495519698453366, "macro_f1": 0.697808259054681}, {"mcc": 0.5375440090128434, "macro_f1": 0.6711060922679231}, {"mcc": 0.5476660502804629, "macro_f1": 0.693371359883974}, {"mcc": 0.5546997970000596, "macro_f1": 0.697519461792139}, {"mcc": 0.547812057833584, "macro_f1": 0.6685237832581093}, {"mcc": 0.6048512578793241, "macro_f1": 0.7377825946646288}, {"mcc": 0.5497968387386895, "macro_f1": 0.6904415645861608}, {"mcc": 0.48967313328133427, "macro_f1": 0.6346188298001039}, {"mcc": 0.5294142358533404, "macro_f1": 0.6675985544039618}, {"mcc": 0.5401629967713221, "macro_f1": 0.6855736170175845}], "total": {"test_mcc": 54.51172346496297, "test_mcc_se": 1.7430455697317222, "test_macro_f1": 68.44344116729266, "test_macro_f1_se": 1.6668568196746847}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "norec", "task": "sentiment-classification", "dataset_languages": ["nb", "nn", "no"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.6785755169056348, "macro_f1": 0.7822670175869985}, {"mcc": 0.6648365884734226, "macro_f1": 0.7732291849382741}, {"mcc": 0.5685065924878293, "macro_f1": 0.6736409600787651}, {"mcc": 0.6456842484170363, "macro_f1": 0.7572962916758462}, {"mcc": 0.6064035837768045, "macro_f1": 0.7438960844470709}, {"mcc": 0.5951370021534371, "macro_f1": 0.7273119544751682}, {"mcc": 0.6383221925214325, "macro_f1": 0.7516706749717019}, {"mcc": 0.603263250438851, "macro_f1": 0.7066683702490985}, {"mcc": 0.6438224186776832, "macro_f1": 0.7530088819047438}, {"mcc": 0.6205528314118222, "macro_f1": 0.7181956364778302}], "total": {"test_mcc": 62.651042252639535, "test_mcc_se": 2.0983570008940267, "test_macro_f1": 73.87185056805498, "test_macro_f1_se": 2.0290150569104566}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "hotter-and-colder-sentiment", "task": "sentiment-classification", "dataset_languages": ["is"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.4971253789849931, "macro_f1": 0.6574999950328407}, {"mcc": 0.49503006372513897, "macro_f1": 0.6630530597157519}, {"mcc": 0.4568990951550783, "macro_f1": 0.6321287330730718}, {"mcc": 0.4089225379783487, "macro_f1": 0.5975545782205981}, {"mcc": 0.47572149392704716, "macro_f1": 0.6306175209399987}, {"mcc": 0.5198181332119355, "macro_f1": 0.6721286457127728}, {"mcc": 0.503360970358549, "macro_f1": 0.6639834045624239}, {"mcc": 0.45102388505733915, "macro_f1": 0.6185026078842082}, {"mcc": 0.49521697257200387, "macro_f1": 0.6511484568546572}, {"mcc": 0.4821912588550165, "macro_f1": 0.6478460353076353}], "total": {"test_mcc": 47.853097898254504, "test_mcc_se": 1.9933365950630555, "test_macro_f1": 64.34463037303959, "test_macro_f1_se": 1.4500416051200207}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "sb10k", "task": "sentiment-classification", "dataset_languages": ["de"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.48312245997564723, "macro_f1": 0.6119704277315609}, {"mcc": 0.6215888697883776, "macro_f1": 0.7424625699061848}, {"mcc": 0.631671136027995, "macro_f1": 0.7529450174993894}, {"mcc": 0.5672731801803156, "macro_f1": 0.6547230067291349}, {"mcc": 0.6861047296898484, "macro_f1": 0.7869013185324162}, {"mcc": 0.6502939492585968, "macro_f1": 0.7568704500141692}, {"mcc": 0.611257805369223, "macro_f1": 0.7338902188060324}, {"mcc": 0.5638281621123793, "macro_f1": 0.6866138258597286}, {"mcc": 0.6688151715966171, "macro_f1": 0.7777940225856727}, {"mcc": 0.6843464345729227, "macro_f1": 0.7898384357038029}], "total": {"test_mcc": 61.683018985719215, "test_mcc_se": 3.9523419336658656, "test_macro_f1": 72.94009293368092, "test_macro_f1_se": 3.695695270958215}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "dutch-social", "task": "sentiment-classification", "dataset_languages": ["nl"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.18473283321410675, "macro_f1": 0.4334037318263675}, {"mcc": 0.20854735180376877, "macro_f1": 0.44458331047995614}, {"mcc": 0.15316920082856625, "macro_f1": 0.40634862142183814}, {"mcc": 0.1622548933951797, "macro_f1": 0.2875532644258724}, {"mcc": 0.14807962971826985, "macro_f1": 0.4081924359517477}, {"mcc": 0.11387393640506001, "macro_f1": 0.3422810458239365}, {"mcc": 0.13719883080889642, "macro_f1": 0.3548432724803903}, {"mcc": 0.17031586518791697, "macro_f1": 0.4019768089270364}, {"mcc": 0.05282216321453481, "macro_f1": 0.31438826666120845}, {"mcc": 0.1336347369024637, "macro_f1": 0.3521812377996165}], "total": {"test_mcc": 14.646294414787633, "test_mcc_se": 2.6349686700097164, "test_macro_f1": 37.457519957979706, "test_macro_f1_se": 3.226003863044401}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "sst5", "task": "sentiment-classification", "dataset_languages": ["en"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.6875966361621771, "macro_f1": 0.6741341928034358}, {"mcc": 0.6787887323637081, "macro_f1": 0.6378456956599218}, {"mcc": 0.703661735850759, "macro_f1": 0.6972215199185232}, {"mcc": 0.6878818307230166, "macro_f1": 0.6634345438768979}, {"mcc": 0.5685082086115124, "macro_f1": 0.6574958398396383}, {"mcc": 0.6763900452820425, "macro_f1": 0.6906949817450941}, {"mcc": 0.6926389361501009, "macro_f1": 0.7089989349458419}, {"mcc": 0.6842415520666201, "macro_f1": 0.6686920383766973}, {"mcc": 0.686756128568964, "macro_f1": 0.6290847631786985}, {"mcc": 0.7014484382314222, "macro_f1": 0.7159438980006657}], "total": {"test_mcc": 67.67912244010323, "test_mcc_se": 2.418664372604858, "test_macro_f1": 67.43546408345415, "test_macro_f1_se": 1.7937112380620517}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "fosent", "task": "sentiment-classification", "dataset_languages": ["fo"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.6363547958650176, "macro_f1": 0.7514369814711066}, {"mcc": 0.5819830672939563, "macro_f1": 0.7060279179142229}, {"mcc": 0.5856711346206503, "macro_f1": 0.656215907769199}, {"mcc": 0.4941539417412506, "macro_f1": 0.6533065699732367}, {"mcc": 0.33454580525054733, "macro_f1": 0.5422853857968362}, {"mcc": 0.5648839514402847, "macro_f1": 0.7011880297451837}, {"mcc": 0.486199310073819, "macro_f1": 0.6633085382280833}, {"mcc": 0.6009684786572725, "macro_f1": 0.7306335131303806}, {"mcc": 0.6114653859067226, "macro_f1": 0.717375019161174}, {"mcc": 0.4406287030489584, "macro_f1": 0.5826542064637302}], "total": {"test_mcc": 53.36854573898479, "test_mcc_se": 5.8165040262576735, "test_macro_f1": 67.04432069653153, "test_macro_f1_se": 4.094440304697722}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "suc3", "task": "named-entity-recognition", "dataset_languages": ["sv"], "model": "google/gemma-2-27b", "results": {"raw": [{"micro_f1_no_misc": 0.7125876257238647, "micro_f1": 0.472506636329162}, {"micro_f1_no_misc": 0.7531071548538796, "micro_f1": 0.6796583021890016}, {"micro_f1_no_misc": 0.7636363636363638, "micro_f1": 0.6268181818181818}, {"micro_f1_no_misc": 0.7220764071157773, "micro_f1": 0.6380111524163569}, {"micro_f1_no_misc": 0.7517220724767895, "micro_f1": 0.6843447220805712}, {"micro_f1_no_misc": 0.737146529562982, "micro_f1": 0.6032540675844805}, {"micro_f1_no_misc": 0.6988098870918523, "micro_f1": 0.6013164080865069}, {"micro_f1_no_misc": 0.6676470588235294, "micro_f1": 0.4376293508936971}, {"micro_f1_no_misc": 0.6440677966101694, "micro_f1": 0.5673575129533679}, {"micro_f1_no_misc": 0.7179487179487178, "micro_f1": 0.6505711318795432}], "total": {"test_micro_f1_no_misc": 71.68749613843926, "test_micro_f1_no_misc_se": 2.375642900410976, "test_micro_f1": 59.61467466230869, "test_micro_f1_se": 5.133102542723377}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "dansk", "task": "named-entity-recognition", "dataset_languages": ["da"], "model": "google/gemma-2-27b", "results": {"raw": [{"micro_f1_no_misc": 0.6248269497000462, "micro_f1": 0.5303086649312012}, {"micro_f1_no_misc": 0.6772082878953108, "micro_f1": 0.4765739385065886}, {"micro_f1_no_misc": 0.6276422764227642, "micro_f1": 0.5314625850340136}, {"micro_f1_no_misc": 0.6635730858468678, "micro_f1": 0.5437821927888153}, {"micro_f1_no_misc": 0.6005535055350554, "micro_f1": 0.35356068204613844}, {"micro_f1_no_misc": 0.5850746268656716, "micro_f1": 0.4639175257731959}, {"micro_f1_no_misc": 0.5744075829383886, "micro_f1": 0.4749813293502614}, {"micro_f1_no_misc": 0.6673407482305358, "micro_f1": 0.5152257612880644}, {"micro_f1_no_misc": 0.6076479832372971, "micro_f1": 0.42510876320696084}, {"micro_f1_no_misc": 0.599791013584117, "micro_f1": 0.4318109230277866}], "total": {"test_micro_f1_no_misc": 62.28066060256053, "test_micro_f1_no_misc_se": 2.2302577804643784, "test_micro_f1": 47.46732365953026, "test_micro_f1_se": 3.685538099350944}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "norne-nb", "task": "named-entity-recognition", "dataset_languages": ["nb", "no"], "model": "google/gemma-2-27b", "results": {"raw": [{"micro_f1_no_misc": 0.7927644514353127, "micro_f1": 0.7652936021034181}, {"micro_f1_no_misc": 0.7915309446254072, "micro_f1": 0.7633231652610802}, {"micro_f1_no_misc": 0.7465293327362293, "micro_f1": 0.6111752101532882}, {"micro_f1_no_misc": 0.8149111400086692, "micro_f1": 0.7863908757007539}, {"micro_f1_no_misc": 0.7666806546370122, "micro_f1": 0.7074067588863597}, {"micro_f1_no_misc": 0.7547092547092547, "micro_f1": 0.7078530259365994}, {"micro_f1_no_misc": 0.7379248658318426, "micro_f1": 0.6339489885664028}, {"micro_f1_no_misc": 0.7343612334801763, "micro_f1": 0.6966207287143665}, {"micro_f1_no_misc": 0.791372707115501, "micro_f1": 0.760821716801174}, {"micro_f1_no_misc": 0.7710947641713545, "micro_f1": 0.7314306622758873}], "total": {"test_micro_f1_no_misc": 77.0187934875076, "test_micro_f1_no_misc_se": 1.6722121640646848, "test_micro_f1": 71.6426473439933, "test_micro_f1_se": 3.584762519471857}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "norne-nn", "task": "named-entity-recognition", "dataset_languages": ["nn"], "model": "google/gemma-2-27b", "results": {"raw": [{"micro_f1_no_misc": 0.7583087512291052, "micro_f1": 0.7541204170871175}, {"micro_f1_no_misc": 0.7325512030224697, "micro_f1": 0.6289846517119244}, {"micro_f1_no_misc": 0.7488884593079451, "micro_f1": 0.6983115641924179}, {"micro_f1_no_misc": 0.7910366186919293, "micro_f1": 0.7902566617623018}, {"micro_f1_no_misc": 0.7318982387475538, "micro_f1": 0.6956668923493569}, {"micro_f1_no_misc": 0.7893052155902843, "micro_f1": 0.5998908296943232}, {"micro_f1_no_misc": 0.7733285663210583, "micro_f1": 0.7573228604156439}, {"micro_f1_no_misc": 0.7057313943541489, "micro_f1": 0.6478613569321534}, {"micro_f1_no_misc": 0.7673469387755103, "micro_f1": 0.6891294404634852}, {"micro_f1_no_misc": 0.7800427101533682, "micro_f1": 0.7127710080708972}], "total": {"test_micro_f1_no_misc": 75.78438096193373, "test_micro_f1_no_misc_se": 1.7314526389327531, "test_micro_f1": 69.74315682679621, "test_micro_f1_se": 3.717781526165303}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "mim-gold-ner", "task": "named-entity-recognition", "dataset_languages": ["is"], "model": "google/gemma-2-27b", "results": {"raw": [{"micro_f1_no_misc": 0.7192164179104477, "micro_f1": 0.675339366515837}, {"micro_f1_no_misc": 0.6239054290718038, "micro_f1": 0.4509995805955543}, {"micro_f1_no_misc": 0.708634828750603, "micro_f1": 0.6209833526906698}, {"micro_f1_no_misc": 0.7112221528211535, "micro_f1": 0.633434038267875}, {"micro_f1_no_misc": 0.6767905711695377, "micro_f1": 0.6429078014184396}, {"micro_f1_no_misc": 0.6135389888603257, "micro_f1": 0.584643179765131}, {"micro_f1_no_misc": 0.687880205896116, "micro_f1": 0.6420341676599125}, {"micro_f1_no_misc": 0.698364008179959, "micro_f1": 0.5912231559290384}, {"micro_f1_no_misc": 0.6965676984628343, "micro_f1": 0.6023942537909018}, {"micro_f1_no_misc": 0.6707677165354331, "micro_f1": 0.5540658192582274}], "total": {"test_micro_f1_no_misc": 68.06888017658214, "test_micro_f1_no_misc_se": 2.2310146831073556, "test_micro_f1": 59.980247158915866, "test_micro_f1_se": 3.8911135796122944}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "fone", "task": "named-entity-recognition", "dataset_languages": ["fo"], "model": "google/gemma-2-27b", "results": {"raw": [{"micro_f1_no_misc": 0.7640385165411383, "micro_f1": 0.7281639928698752}, {"micro_f1_no_misc": 0.7619511245253626, "micro_f1": 0.7554082665930222}, {"micro_f1_no_misc": 0.764535410251766, "micro_f1": 0.7093855391955252}, {"micro_f1_no_misc": 0.6831155253528489, "micro_f1": 0.6713596914175506}, {"micro_f1_no_misc": 0.764981553494865, "micro_f1": 0.7567161375164382}, {"micro_f1_no_misc": 0.7755775577557756, "micro_f1": 0.758}, {"micro_f1_no_misc": 0.7118226600985221, "micro_f1": 0.7101407641482333}, {"micro_f1_no_misc": 0.7949482895783612, "micro_f1": 0.7774407582938389}, {"micro_f1_no_misc": 0.716816003334028, "micro_f1": 0.7212671431330887}, {"micro_f1_no_misc": 0.807099668422079, "micro_f1": 0.793080878915381}], "total": {"test_micro_f1_no_misc": 75.44886309354746, "test_micro_f1_no_misc_se": 2.404169777852516, "test_micro_f1": 73.80963172082954, "test_micro_f1_se": 2.268423393005984}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "germeval", "task": "named-entity-recognition", "dataset_languages": ["de"], "model": "google/gemma-2-27b", "results": {"raw": [{"micro_f1_no_misc": 0.7025109170305677, "micro_f1": 0.6427579824155484}, {"micro_f1_no_misc": 0.7484472049689441, "micro_f1": 0.6912336919203479}, {"micro_f1_no_misc": 0.7047772410091251, "micro_f1": 0.6458511548331908}, {"micro_f1_no_misc": 0.6762095619811052, "micro_f1": 0.6055427251732103}, {"micro_f1_no_misc": 0.632340179346254, "micro_f1": 0.5904673762147153}, {"micro_f1_no_misc": 0.7217478653942743, "micro_f1": 0.6720879120879121}, {"micro_f1_no_misc": 0.7071509648127129, "micro_f1": 0.6737438075017693}, {"micro_f1_no_misc": 0.705611510791367, "micro_f1": 0.6396270396270396}, {"micro_f1_no_misc": 0.7004081632653062, "micro_f1": 0.627630375114364}, {"micro_f1_no_misc": 0.6994535519125683, "micro_f1": 0.6258699304055675}], "total": {"test_micro_f1_no_misc": 69.98657160512225, "test_micro_f1_no_misc_se": 1.8566596009559497, "test_micro_f1": 64.14811995293664, "test_micro_f1_se": 1.9388334560673293}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "conll-nl", "task": "named-entity-recognition", "dataset_languages": ["nl"], "model": "google/gemma-2-27b", "results": {"raw": [{"micro_f1_no_misc": 0.7256038647342996, "micro_f1": 0.62460345435319}, {"micro_f1_no_misc": 0.6973833902161548, "micro_f1": 0.6548069644208934}, {"micro_f1_no_misc": 0.707182320441989, "micro_f1": 0.6092032967032968}, {"micro_f1_no_misc": 0.6896156052782559, "micro_f1": 0.6735504368546466}, {"micro_f1_no_misc": 0.7084577114427861, "micro_f1": 0.6458827516096238}, {"micro_f1_no_misc": 0.7501287995878412, "micro_f1": 0.627254509018036}, {"micro_f1_no_misc": 0.7701333333333333, "micro_f1": 0.6571224051539012}, {"micro_f1_no_misc": 0.7154471544715448, "micro_f1": 0.6553359683794466}, {"micro_f1_no_misc": 0.7264260768335273, "micro_f1": 0.5896700143472022}, {"micro_f1_no_misc": 0.7373974208675264, "micro_f1": 0.635646032405484}], "total": {"test_micro_f1_no_misc": 72.27775677207259, "test_micro_f1_no_misc_se": 1.5283368847511172, "test_micro_f1": 63.730758332457214, "test_micro_f1_se": 1.5646433051805895}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "conll-en", "task": "named-entity-recognition", "dataset_languages": ["en"], "model": "google/gemma-2-27b", "results": {"raw": [{"micro_f1_no_misc": 0.7821112844513781, "micro_f1": 0.7613572101790763}, {"micro_f1_no_misc": 0.769151881858271, "micro_f1": 0.7379641144075324}, {"micro_f1_no_misc": 0.8019533369506241, "micro_f1": 0.7773556935050495}, {"micro_f1_no_misc": 0.7873369793857803, "micro_f1": 0.7510134816630527}, {"micro_f1_no_misc": 0.7636946075814202, "micro_f1": 0.7320058280718796}, {"micro_f1_no_misc": 0.7950069945119983, "micro_f1": 0.7660741301059002}, {"micro_f1_no_misc": 0.6025539608995366, "micro_f1": 0.622500467202392}, {"micro_f1_no_misc": 0.7651378099393423, "micro_f1": 0.7334945136693323}, {"micro_f1_no_misc": 0.7785373938263932, "micro_f1": 0.7547312641937927}, {"micro_f1_no_misc": 0.754203197524497, "micro_f1": 0.7407341569638255}], "total": {"test_micro_f1_no_misc": 75.99687446929241, "test_micro_f1_no_misc_se": 3.5492347009784653, "test_micro_f1": 73.77230859961834, "test_micro_f1_se": 2.672680722311093}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scala-sv", "task": "linguistic-acceptability", "dataset_languages": ["sv"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.5889921948975404, "macro_f1": 0.7912755891044387}, {"mcc": 0.5612103270108691, "macro_f1": 0.7806034864297986}, {"mcc": 0.638906761144078, "macro_f1": 0.8190017580065734}, {"mcc": 0.5681333481848871, "macro_f1": 0.7812270656754219}, {"mcc": 0.5928504652915157, "macro_f1": 0.7909753354593212}, {"mcc": 0.5943517068783347, "macro_f1": 0.7971330047215174}, {"mcc": 0.5826192636922987, "macro_f1": 0.7721913236929923}, {"mcc": 0.6149516221077359, "macro_f1": 0.8050139554927517}, {"mcc": 0.6105575313306159, "macro_f1": 0.804198798606202}, {"mcc": 0.5944658486746895, "macro_f1": 0.7957868312711935}], "total": {"test_mcc": 59.470390692125655, "test_mcc_se": 1.4050908748612045, "test_macro_f1": 79.3740714846021, "test_macro_f1_se": 0.8512285511795253}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scala-da", "task": "linguistic-acceptability", "dataset_languages": ["da"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.5312071141081297, "macro_f1": 0.7655677655677655}, {"mcc": 0.5060093890300942, "macro_f1": 0.747420349434738}, {"mcc": 0.45548895004365986, "macro_f1": 0.6968340402026325}, {"mcc": 0.5059875230284765, "macro_f1": 0.7329951524776883}, {"mcc": 0.46617764203019846, "macro_f1": 0.7078239107497282}, {"mcc": 0.5332733732906881, "macro_f1": 0.7651510807866522}, {"mcc": 0.5379603663608457, "macro_f1": 0.7635115257274552}, {"mcc": 0.4926802707950223, "macro_f1": 0.7262635275757625}, {"mcc": 0.5506545352050426, "macro_f1": 0.7689009809014264}, {"mcc": 0.5554544766388321, "macro_f1": 0.7777047799054511}], "total": {"test_mcc": 51.3489364053099, "test_mcc_se": 2.1265141252691797, "test_macro_f1": 74.521731133293, "test_macro_f1_se": 1.7304349575093536}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scala-nb", "task": "linguistic-acceptability", "dataset_languages": ["nb", "no"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.5995091142338563, "macro_f1": 0.7976893012068892}, {"mcc": 0.5435840583176177, "macro_f1": 0.7688927861653077}, {"mcc": 0.5786693784960071, "macro_f1": 0.7765878887225559}, {"mcc": 0.5792315603049804, "macro_f1": 0.7800007129321067}, {"mcc": 0.3193817778215631, "macro_f1": 0.5463388075392741}, {"mcc": 0.6100461478179665, "macro_f1": 0.7962778574421707}, {"mcc": 0.6145968361693808, "macro_f1": 0.8035580889414595}, {"mcc": 0.5990177545042203, "macro_f1": 0.7911330049261084}, {"mcc": 0.54083541039275, "macro_f1": 0.7523679013340856}, {"mcc": 0.5460165688336591, "macro_f1": 0.7722522109744515}], "total": {"test_mcc": 55.30888606892, "test_mcc_se": 5.369230575793629, "test_macro_f1": 75.85098560184409, "test_macro_f1_se": 4.720064132726272}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scala-nn", "task": "linguistic-acceptability", "dataset_languages": ["nn"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.22080921404838819, "macro_f1": 0.48556334656866884}, {"mcc": 0.4641889284994848, "macro_f1": 0.7277858474584562}, {"mcc": 0.4996017599586461, "macro_f1": 0.7444550665545586}, {"mcc": 0.42434402251244147, "macro_f1": 0.7080834452502462}, {"mcc": 0.41952545907731814, "macro_f1": 0.709356171542254}, {"mcc": 0.4067569748675894, "macro_f1": 0.7033501066558042}, {"mcc": 0.5293461782638781, "macro_f1": 0.76425184273776}, {"mcc": 0.43858992035699906, "macro_f1": 0.7149691126954691}, {"mcc": 0.46001963695892223, "macro_f1": 0.7299804043723117}, {"mcc": 0.4347050614579048, "macro_f1": 0.7097711679124525}], "total": {"test_mcc": 42.97887156001572, "test_mcc_se": 5.116160547536689, "test_macro_f1": 69.9756651174798, "test_macro_f1_se": 4.811726578439137}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scala-is", "task": "linguistic-acceptability", "dataset_languages": ["is"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.19186928276050666, "macro_f1": 0.5760280701754386}, {"mcc": 0.1987207609548552, "macro_f1": 0.5903579981059578}, {"mcc": 0.09675031799148327, "macro_f1": 0.40088298048407367}, {"mcc": 0.18385507891838584, "macro_f1": 0.5729845744505458}, {"mcc": 0.16095988840992476, "macro_f1": 0.5515626209152068}, {"mcc": 0.1703097698148867, "macro_f1": 0.5848595848595849}, {"mcc": 0.1352195511371179, "macro_f1": 0.5512814439338671}, {"mcc": 0.10343902265678158, "macro_f1": 0.49695052290717256}, {"mcc": 0.14518128862758176, "macro_f1": 0.4932092507143718}, {"mcc": 0.15248450035245617, "macro_f1": 0.5358567097957408}], "total": {"test_mcc": 15.3878946162398, "test_mcc_se": 2.1576195654242136, "test_macro_f1": 53.5397375634196, "test_macro_f1_se": 3.602040724314056}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scala-fo", "task": "linguistic-acceptability", "dataset_languages": ["fo"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.1384153768162299, "macro_f1": 0.5275735219397191}, {"mcc": 0.10776334358916206, "macro_f1": 0.49270596171683956}, {"mcc": 0.010344520084234973, "macro_f1": 0.3387935429056924}, {"mcc": 0.11356306882366897, "macro_f1": 0.5352482431498418}, {"mcc": 0.14607818540103526, "macro_f1": 0.5641021635554168}, {"mcc": 0.1312707430169637, "macro_f1": 0.48529138761538854}, {"mcc": 0.1412376440018958, "macro_f1": 0.5694444444444444}, {"mcc": 0.0664244965389879, "macro_f1": 0.533139011399881}, {"mcc": 0.12376997472716825, "macro_f1": 0.5046214835839937}, {"mcc": 0.10661114769224578, "macro_f1": 0.5507195507195507}], "total": {"test_mcc": 10.854785006915927, "test_mcc_se": 2.5775677093440623, "test_macro_f1": 51.016393110307675, "test_macro_f1_se": 4.12430834743172}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scala-de", "task": "linguistic-acceptability", "dataset_languages": ["de"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.47352546560893893, "macro_f1": 0.731551824115649}, {"mcc": 0.33705877544388796, "macro_f1": 0.6106426411493507}, {"mcc": 0.49293768038027536, "macro_f1": 0.7321428571428572}, {"mcc": 0.4362158894662926, "macro_f1": 0.6794587107357832}, {"mcc": 0.4097326380360321, "macro_f1": 0.676088617265088}, {"mcc": 0.46871891261349946, "macro_f1": 0.7178043195742311}, {"mcc": 0.23926370094198135, "macro_f1": 0.6143375086209222}, {"mcc": 0.4881039666129266, "macro_f1": 0.7425161086078856}, {"mcc": 0.4289050409562341, "macro_f1": 0.6965056361338586}, {"mcc": 0.4522855500722244, "macro_f1": 0.7169444412227493}], "total": {"test_mcc": 42.26747620132293, "test_mcc_se": 4.896074363890806, "test_macro_f1": 69.17992664568375, "test_macro_f1_se": 2.9288429314083224}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scala-nl", "task": "linguistic-acceptability", "dataset_languages": ["nl"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.5263514531182735, "macro_f1": 0.7568410022544246}, {"mcc": 0.5314174022099742, "macro_f1": 0.7644447622096981}, {"mcc": 0.5193833237960342, "macro_f1": 0.7455900621118012}, {"mcc": 0.5650198815464519, "macro_f1": 0.7745645988681866}, {"mcc": 0.5243623611241239, "macro_f1": 0.755996999063471}, {"mcc": 0.5518719601966261, "macro_f1": 0.7654159387663215}, {"mcc": 0.5505200202648795, "macro_f1": 0.7644140780993486}, {"mcc": 0.5440378121661968, "macro_f1": 0.7714599768930881}, {"mcc": 0.5757035461457337, "macro_f1": 0.7853960282341412}, {"mcc": 0.5095571634210267, "macro_f1": 0.7500699490157388}], "total": {"test_mcc": 53.98224923989321, "test_mcc_se": 1.3106840419528871, "test_macro_f1": 76.3419339551622, "test_macro_f1_se": 0.7369431816853134}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scala-en", "task": "linguistic-acceptability", "dataset_languages": ["en"], "model": "google/gemma-2-27b", "results": {"raw": [{"mcc": 0.5242477676329164, "macro_f1": 0.7604752266101644}, {"mcc": 0.5014663773939598, "macro_f1": 0.7465993859590323}, {"mcc": 0.41081255484095464, "macro_f1": 0.6976106386089386}, {"mcc": 0.4834021718720495, "macro_f1": 0.7238723744795351}, {"mcc": 0.4508415692097128, "macro_f1": 0.7119987231777303}, {"mcc": 0.42835307259691124, "macro_f1": 0.6574686693743577}, {"mcc": 0.475747523557543, "macro_f1": 0.7299991242985968}, {"mcc": 0.5164005418059514, "macro_f1": 0.7551086267504178}, {"mcc": 0.49610389071750266, "macro_f1": 0.7379053094071479}, {"mcc": 0.46415292460147645, "macro_f1": 0.72971711646233}], "total": {"test_mcc": 47.51528394228978, "test_mcc_se": 2.296293826481112, "test_macro_f1": 72.50755195128251, "test_macro_f1_se": 1.8852090936290922}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scandiqa-da", "task": "reading-comprehension", "dataset_languages": ["da"], "model": "google/gemma-2-27b", "results": {"raw": [{"em": 15.303030303030303, "f1": 21.356050156518073}, {"em": 15.5420773313116, "f1": 19.695345479091973}, {"em": 16.317016317016318, "f1": 19.685991839101888}, {"em": 16.133942161339423, "f1": 20.865168656953724}, {"em": 16.499614494988435, "f1": 19.52937792953213}, {"em": 15.572519083969466, "f1": 20.17517182555115}, {"em": 14.508580343213728, "f1": 20.8722478695564}, {"em": 15.330188679245284, "f1": 23.551482304877688}, {"em": 15.944272445820433, "f1": 19.457774337805297}, {"em": 14.737654320987655, "f1": 19.5968058005095}], "total": {"test_em": 15.588889548092265, "test_em_se": 0.4041033450194408, "test_f1": 20.478541619949784, "test_f1_se": 0.7886513006355999}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "norquad", "task": "reading-comprehension", "dataset_languages": ["nb", "nn", "no"], "model": "google/gemma-2-27b", "results": {"raw": [{"em": 8.49673202614379, "f1": 39.98907985271517}, {"em": 6.504065040650406, "f1": 37.84617682297176}, {"em": 8.076602830974188, "f1": 33.503625540453534}, {"em": 7.76778413736713, "f1": 38.2627775037928}, {"em": 6.368899917287014, "f1": 38.34197493634812}, {"em": 6.666666666666667, "f1": 31.6468189134201}, {"em": 6.050420168067227, "f1": 32.17997735130463}, {"em": 6.925675675675675, "f1": 30.37378657060356}, {"em": 8.305647840531561, "f1": 27.557260700542493}, {"em": 5.960264900662252, "f1": 37.65823171461238}], "total": {"test_em": 7.112275920402591, "test_em_se": 0.5956112645139311, "test_f1": 34.73597099067645, "test_f1_se": 2.608485538975757}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "scandiqa-sv", "task": "reading-comprehension", "dataset_languages": ["sv"], "model": "google/gemma-2-27b", "results": {"raw": [{"em": 13.181818181818182, "f1": 16.883838383838384}, {"em": 11.599696739954512, "f1": 24.754728833236136}, {"em": 14.918414918414918, "f1": 19.13173419000925}, {"em": 15.296803652968036, "f1": 20.45305888990949}, {"em": 13.030069390902081, "f1": 20.320256791298572}, {"em": 13.206106870229007, "f1": 17.013994910941477}, {"em": 13.962558502340094, "f1": 20.93602831805766}, {"em": 14.30817610062893, "f1": 18.91011255259314}, {"em": 13.854489164086687, "f1": 18.850180726840144}, {"em": 13.657407407407407, "f1": 17.70431498867171}], "total": {"test_em": 13.701554092874986, "test_em_se": 0.6491159282476638, "test_f1": 19.495824858539596, "test_f1_se": 1.4374844849533637}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "nqii", "task": "reading-comprehension", "dataset_languages": ["is"], "model": "google/gemma-2-27b", "results": {"raw": [{"em": 13.384615384615385, "f1": 27.705276817722794}, {"em": 10.43613707165109, "f1": 22.29829924011203}, {"em": 12.77258566978193, "f1": 23.99210268168778}, {"em": 10.911808669656203, "f1": 24.77566227109337}, {"em": 11.442006269592477, "f1": 26.714530496371097}, {"em": 12.788906009244993, "f1": 23.718855583197623}, {"em": 12.923076923076923, "f1": 27.174069109473045}, {"em": 11.042944785276074, "f1": 20.17183581741789}, {"em": 11.295180722891565, "f1": 28.783491050925498}, {"em": 13.403614457831326, "f1": 22.768743960007562}], "total": {"test_em": 12.040087596361797, "test_em_se": 0.6943942199858668, "test_f1": 24.81028670280087, "test_f1_se": 1.6940869395850615}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "foqa", "task": "reading-comprehension", "dataset_languages": ["fo"], "model": "google/gemma-2-27b", "results": {"raw": [{"em": 16.91176470588235, "f1": 36.54623305073573}, {"em": 19.9017199017199, "f1": 37.16462288300298}, {"em": 18.271604938271604, "f1": 43.17278669255752}, {"em": 18.932038834951456, "f1": 41.009195555169306}, {"em": 16.08910891089109, "f1": 44.276740042663235}, {"em": 18.02469135802469, "f1": 37.19336904397609}, {"em": 16.9811320754717, "f1": 40.98086916239511}, {"em": 17.8117048346056, "f1": 47.35419235291609}, {"em": 18.453865336658353, "f1": 39.534661243143276}, {"em": 18.181818181818183, "f1": 45.16963346991129}], "total": {"test_em": 17.95594490782949, "test_em_se": 0.6754520306582787, "test_f1": 41.240230349647064, "test_f1_se": 2.2990166891391906}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "germanquad", "task": "reading-comprehension", "dataset_languages": ["de"], "model": "google/gemma-2-27b", "results": {"raw": [{"em": 13.257575757575758, "f1": 29.39861138171179}, {"em": 10.235026535253981, "f1": 26.716513239099918}, {"em": 10.722610722610723, "f1": 32.84903118114236}, {"em": 13.013698630136986, "f1": 32.65657864860939}, {"em": 12.41326137239784, "f1": 34.39668258788835}, {"em": 11.603053435114504, "f1": 28.959630874012042}, {"em": 12.090483619344774, "f1": 26.38707909399387}, {"em": 10.849056603773585, "f1": 31.55043119946519}, {"em": 12.61609907120743, "f1": 25.909421302506875}, {"em": 12.345679012345679, "f1": 28.196464262456338}], "total": {"test_em": 11.914654475976125, "test_em_se": 0.6348900712180469, "test_f1": 29.702044377088612, "test_f1_se": 1.8626917417142212}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "squad", "task": "reading-comprehension", "dataset_languages": ["en"], "model": "google/gemma-2-27b", "results": {"raw": [{"em": 13.93939393939394, "f1": 35.158984274670594}, {"em": 12.964366944655042, "f1": 37.904134355524704}, {"em": 15.22921522921523, "f1": 36.1293411926665}, {"em": 14.231354642313546, "f1": 37.30525720013603}, {"em": 14.726291441788744, "f1": 40.28389238091828}, {"em": 13.358778625954198, "f1": 23.973703055521007}, {"em": 13.650546021840874, "f1": 33.44553093189277}, {"em": 13.364779874213836, "f1": 40.751581656244525}, {"em": 12.693498452012383, "f1": 31.92970137708725}, {"em": 13.88888888888889, "f1": 28.835482343920223}], "total": {"test_em": 13.804711406027668, "test_em_se": 0.48200356390948995, "test_f1": 34.57176087685819, "test_f1_se": 3.2380483134949345}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "squad-nl", "task": "reading-comprehension", "dataset_languages": ["nl"], "model": "google/gemma-2-27b", "results": {"raw": [{"em": 15.606060606060606, "f1": 29.736126469717732}, {"em": 15.921152388172858, "f1": 28.975263498384145}, {"em": 16.93861693861694, "f1": 27.96141231154583}, {"em": 17.579908675799086, "f1": 28.571688724604037}, {"em": 16.962220508866615, "f1": 27.917771662326885}, {"em": 15.954198473282442, "f1": 24.420830640652596}, {"em": 16.770670826833072, "f1": 22.460146858725857}, {"em": 18.00314465408805, "f1": 29.65684635618931}, {"em": 15.092879256965944, "f1": 31.070310958649916}, {"em": 16.743827160493826, "f1": 26.72647993117561}], "total": {"test_em": 16.557267948917943, "test_em_se": 0.5598757032840742, "test_f1": 27.749687741197192, "test_f1_se": 1.6131912530141794}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}
{"dataset": "nordjylland-news", "task": "summarization", "dataset_languages": ["da"], "model": "google/gemma-2-27b", "results": {"raw": [{"bertscore": 0.6725108948157867, "rouge_l": 0.22883554594257474}, {"bertscore": 0.6581661547679687, "rouge_l": 0.21190716789868058}, {"bertscore": 0.6149541602499085, "rouge_l": 0.19138548136613698}, {"bertscore": 0.6611860719422111, "rouge_l": 0.2092314363065303}, {"bertscore": 0.6682513057894539, "rouge_l": 0.21966671114081354}, {"bertscore": 0.6651086555793881, "rouge_l": 0.22074389123673033}, {"bertscore": 0.6437325084843906, "rouge_l": 0.20602858414451647}, {"bertscore": 0.6487583433045074, "rouge_l": 0.21156187946174226}, {"bertscore": 0.6561987650056835, "rouge_l": 0.2037092655043311}, {"bertscore": 0.6472448860731674, "rouge_l": 0.21450867681836894}], "total": {"test_bertscore": 65.36111746012466, "test_bertscore_se": 1.0214966173445834, "test_rouge_l": 21.175786398204256, "test_rouge_l_se": 0.6406648223218555}}, "num_model_parameters": 27227128320, "max_sequence_length": 4096, "vocabulary_size": 256000, "generative": true, "few_shot": true, "validation_split": false, "scandeval_version": "14.0.3"}

Mikeriess avatar Dec 16 '24 17:12 Mikeriess

@Mikeriess There are missing quite a few (language, task) combinations there:

  • Missing knowledge and common-sense-reasoning tasks for all languages.
  • There's only a summarization task for Danish.
  • Missing NER evaluations for de, en and nl.

I've published the newest fixes to version 14.0.4, so you might also want to update that for the future evaluations 🙂

saattrupdan avatar Dec 17 '24 11:12 saattrupdan

@saattrupdan ah, bummer.. I'll give it another shot :-)

Mikeriess avatar Dec 17 '24 15:12 Mikeriess

Live on the leaderboards now 🎉

saattrupdan avatar Mar 28 '25 11:03 saattrupdan