dnn_opt Include small-size datasets for benchmarking

Include small-size datasets for benchmarking

Open jairodelgado opened this issue 5 years ago • 0 comments

We need to include a set of benchmark datasets to compare the accuracy and efficiency of our neural implementations. In this case, we are looking for small-size datasets (less that 1k examples) that deals either classification or regression. Probably, a good starting point are the following publicly available datasets:

[ ] Breast cancer
[ ] Contact lenses
[ ] CPU
[ ] Credit
[ ] Diabetes
[ ] Glass
[ ] Ionosphere
[ ] Iris
[ ] Labor
[ ] Soy bean
[ ] Supermarket
[ ] Unbalanced
[ ] Vote
[ ] Wheather

We need to create a notebook transforming the data as follows:

Dataset format should be CSV with header information (coma-separated, header contains attribute names).
Nominal and ordinal values should be transformed into numerical one (create dummies for simplicity).
Normalize numerical values (scale the values to [0, 1])
Ordinal cyclic variables such as day of the week should be transformed using sin and cos transformations.
Remove missing data.
Identify and remove outliers.

Every steep should be documented and explained in the notebook. Include in the notebook the following information for each dataset:

Number of input and output variables.
Variable type: nominal, ordinal, numeric, etc.
Number of examples.
Are the examples balanced or not?
Number of examples removed due to missing values.
Number of outliers.
Best reported accuracy so far, describe the model used (include an ISO 690 reference to the paper).

The output of this issue should be a new directory structure within docs folder. Create the following hierarchy of folders:

docs/bench_data/small/original/: with all the .ARFF original files.
docs/bench_data/small/transformed/: with all the transformed files.
docs/bench_data/small/notebooks: with the transform.ipynb and transform.html notebook that can be used to generate the docs/bench_data/small/transformed/ files from the files in docs/bench_data/small/original/.

Feb 08 '20 20:02 jairodelgado

dnn_opt dnn_opt copied to clipboard

Include small-size datasets for benchmarking

dnn_opt
dnn_opt copied to clipboard