dnn_opt icon indicating copy to clipboard operation
dnn_opt copied to clipboard

Include small-size datasets for benchmarking

Open jairodelgado opened this issue 5 years ago • 0 comments

We need to include a set of benchmark datasets to compare the accuracy and efficiency of our neural implementations. In this case, we are looking for small-size datasets (less that 1k examples) that deals either classification or regression. Probably, a good starting point are the following publicly available datasets:

  • [ ] Breast cancer
  • [ ] Contact lenses
  • [ ] CPU
  • [ ] Credit
  • [ ] Diabetes
  • [ ] Glass
  • [ ] Ionosphere
  • [ ] Iris
  • [ ] Labor
  • [ ] Soy bean
  • [ ] Supermarket
  • [ ] Unbalanced
  • [ ] Vote
  • [ ] Wheather

We need to create a notebook transforming the data as follows:

  • Dataset format should be CSV with header information (coma-separated, header contains attribute names).
  • Nominal and ordinal values should be transformed into numerical one (create dummies for simplicity).
  • Normalize numerical values (scale the values to [0, 1])
  • Ordinal cyclic variables such as day of the week should be transformed using sin and cos transformations.
  • Remove missing data.
  • Identify and remove outliers.

Every steep should be documented and explained in the notebook. Include in the notebook the following information for each dataset:

  • Number of input and output variables.
  • Variable type: nominal, ordinal, numeric, etc.
  • Number of examples.
  • Are the examples balanced or not?
  • Number of examples removed due to missing values.
  • Number of outliers.
  • Best reported accuracy so far, describe the model used (include an ISO 690 reference to the paper).

The output of this issue should be a new directory structure within docs folder. Create the following hierarchy of folders:

  • docs/bench_data/small/original/: with all the .ARFF original files.
  • docs/bench_data/small/transformed/: with all the transformed files.
  • docs/bench_data/small/notebooks: with the transform.ipynb and transform.html notebook that can be used to generate the docs/bench_data/small/transformed/ files from the files in docs/bench_data/small/original/.

jairodelgado avatar Feb 08 '20 20:02 jairodelgado