dnn_opt
dnn_opt copied to clipboard
Include small-size datasets for benchmarking
We need to include a set of benchmark datasets to compare the accuracy and efficiency of our neural implementations. In this case, we are looking for small-size datasets (less that 1k examples) that deals either classification or regression. Probably, a good starting point are the following publicly available datasets:
- [ ] Breast cancer
- [ ] Contact lenses
- [ ] CPU
- [ ] Credit
- [ ] Diabetes
- [ ] Glass
- [ ] Ionosphere
- [ ] Iris
- [ ] Labor
- [ ] Soy bean
- [ ] Supermarket
- [ ] Unbalanced
- [ ] Vote
- [ ] Wheather
We need to create a notebook transforming the data as follows:
- Dataset format should be CSV with header information (coma-separated, header contains attribute names).
- Nominal and ordinal values should be transformed into numerical one (create dummies for simplicity).
- Normalize numerical values (scale the values to [0, 1])
- Ordinal cyclic variables such as day of the week should be transformed using sin and cos transformations.
- Remove missing data.
- Identify and remove outliers.
Every steep should be documented and explained in the notebook. Include in the notebook the following information for each dataset:
- Number of input and output variables.
- Variable type: nominal, ordinal, numeric, etc.
- Number of examples.
- Are the examples balanced or not?
- Number of examples removed due to missing values.
- Number of outliers.
- Best reported accuracy so far, describe the model used (include an ISO 690 reference to the paper).
The output of this issue should be a new directory structure within docs
folder. Create the following hierarchy of folders:
-
docs/bench_data/small/original/
: with all the .ARFF original files. -
docs/bench_data/small/transformed/
: with all the transformed files. -
docs/bench_data/small/notebooks:
with thetransform.ipynb
andtransform.html
notebook that can be used to generate thedocs/bench_data/small/transformed/
files from the files indocs/bench_data/small/original/
.