tabzilla Which datasets are used for main paper (98 datasets) and "small data" (57 datasets)

Hi. I'm trying to compare to some of the results in your work, but it's not clear to me which datasets were use for Table 1 and Table 2. The Datasets A file contains 108 datasets, and the Datasets B file contains 69 datasets, so I'm not sure which the 98 ones are. Really I care more about the 57 small datasets, but cutting off at those with 1250 or less instances doesn't yield 57 for either A or B or the combination.

Mar 27 '24 00:03 amueller

The "easy_import" list seems to contain 175 classification tasks, 69 of which have less than 1250 instances.

Mar 27 '24 00:03 amueller

Hi Andreas, below are the 98 datasets from Table 1 and the 57 datasets from Table 2. Please let us know if you have more questions.

datasets_table_1 = ['openml__visualizing_environmental__3602', 'openml__labor__4', 'openml__monks-problems-2__146065', 'openml__tic-tac-toe__49', 'openml__dermatology__35', 'openml__cardiotocography__9979', 'openml__lung-cancer__146024', 'openml__sonar__39', 'openml__anneal__2867', 'openml__analcatdata_chlamydia__3739', 'openml__iris__59', 'openml__irish__3543', 'openml__heart-c__48', 'openml__ionosphere__145984', 'openml__hayes-roth__146063', 'openml__fri_c3_100_5__3779', 'openml__fri_c0_100_5__3620', 'openml__analcatdata_authorship__3549', 'openml__rabe_266__3647', 'openml__balance-scale__11', 'openml__acute-inflammations__10089', 'openml__MiceProtein__146800', 'openml__banknote-authentication__10093', 'openml__mushroom__24', 'openml__kr-vs-kp__3', 'openml__analcatdata_boxing1__3540', 'openml__musk__3950', 'openml__transplant__3748', 'openml__cjs__14967', 'openml__synthetic_control__3512', 'openml__car-evaluation__146192', 'openml__fertility__9984', 'openml__postoperative-patient-data__146210', 'openml__breast-w__15', 'openml__wdbc__9946', 'openml__car__146821', 'openml__visualizing_livestock__3731', 'openml__mfeat-factors__12', 'openml__Satellite__167211', 'openml__colic__25', 'openml__lymph__10', 'openml__wall-robot-navigation__9960', 'openml__wilt__146820', 'openml__scene__3485', 'openml__mfeat-karhunen__16', 'openml__sick__3021', 'openml__dna__167140', 'openml__socmob__3797', 'openml__page-blocks__30', 'openml__PhishingWebsites__14952', 'openml__spambase__43', 'openml__splice__45', 'openml__churn__167141', 'openml__colic__27', 'openml__ecoli__145977', 'openml__semeion__9964', 'openml__ozone-level-8hr__9978', 'openml__heart-h__50', 'openml__pc1__3918', 'openml__qsar-biodeg__9957', 'openml__autos__9', 'openml__pc4__3902', 'openml__hill-valley__145847', 'openml__satimage__2074', 'openml__pc3__3903', 'openml__mfeat-fourier__14', 'openml__Australian__146818', 'openml__credit-approval__29', 'openml__cylinder-bands__14954', 'openml__mfeat-zernike__22', 'openml__kc2__3913', 'openml__bank-marketing__14965', 'openml__phoneme__9952', 'openml__elevators__3711', 'openml__breast-cancer__145799', 'openml__SpeedDating__146607', 'openml__kc1__3917', 'openml__adult-census__3953', 'openml__ilpd__9971', 'openml__vehicle__53', 'openml__ada_agnostic__3896', 'openml__tae__47', 'openml__blood-transfusion-service-center__10101', 'openml__jasmine__168911', 'openml__LED-display-domain-7digit__125921', 'openml__diabetes__37', 'openml__Click_prediction_small__190408', 'openml__profb__3561', 'openml__steel-plates-fault__146817', 'openml__jm1__3904', 'openml__glass__40', 'openml__dresses-sales__125920', 'openml__mfeat-morphological__18', 'openml__eucalyptus__2079', 'openml__libras__360948', 'openml__yeast__145793', 'openml__cmc__23', 'openml__analcatdata_dmft__3560']

datasets_table_2 = ["openml__Australian__146818", "openml__LED-display-domain-7digit__125921", "openml__MiceProtein__146800", "openml__acute-inflammations__10089", "openml__analcatdata_authorship__3549", "openml__analcatdata_boxing1__3540", "openml__analcatdata_chlamydia__3739", "openml__analcatdata_dmft__3560", "openml__anneal__2867", "openml__autos__9", "openml__balance-scale__11", "openml__blood-transfusion-service-center__10101", "openml__blood-transfusion-service-center__145836", "openml__breast-cancer__145799", "openml__breast-w__15", "openml__colic__25", "openml__colic__27", "openml__credit-approval__29", "openml__cylinder-bands__14954", "openml__dermatology__35", "openml__diabetes__37", "openml__dresses-sales__125920", "openml__ecoli__145977", "openml__eucalyptus__2079", "openml__fertility__9984", "openml__fri_c0_100_5__3620", "openml__fri_c3_100_5__3779", "openml__glass__40", "openml__hayes-roth__146063", "openml__heart-c__48", "openml__heart-h__50", "openml__hill-valley__145847", "openml__ilpd__9971", "openml__ionosphere__145984", "openml__iris__59", "openml__irish__3543", "openml__kc2__3913", "openml__labor__4", "openml__lung-cancer__146024", "openml__lymph__10", "openml__monks-problems-2__146065", "openml__pc1__3918", "openml__postoperative-patient-data__146210", "openml__profb__3561", "openml__qsar-biodeg__9957", "openml__rabe_266__3647", "openml__socmob__3797", "openml__sonar__39", "openml__synthetic_control__3512", "openml__tae__47", "openml__tic-tac-toe__49", "openml__transplant__3748", "openml__vehicle__53", "openml__visualizing_environmental__3602", "openml__visualizing_livestock__3731", "openml__wdbc__9946", "openml__yeast__145793"]

Mar 29 '24 21:03 crwhite14

I just saw this issue. Are you aware that the datasets for Table 2 have a duplicate? "openml__blood-transfusion-service-center__10101", "openml__blood-transfusion-service-center__145836"?

Jun 04 '24 09:06 LennartPurucker

@LennartPurucker thanks for pointing this out - cc @crwhite14 . so we could remove the duplicate dataset from results that include it.

it looks like we accidentally pulled two different openML tasks (https://openml.org/search?type=task&id=145836 and https://openml.org/search?type=task&id=10101) which appear to be identical, because they are based on the same dataset (https://openml.org/search?type=data&id=1464)

Sep 23 '24 13:09 duncanmcelfresh