[EXE] pt1: Simple Decision Tree exercise, pt2: Pipelines
Learning Goals
Part 1:
- Work with scikit-learn library, train-test set split, report different scores.
- Decision Trees.
Part 2:
- Work with Pipelines (with DecisionTrees), imputers, scalers and encoders.
- Grid Search.
Exercise Statement
Part 1: Apply different Decision Trees to train a model for detecting breast cancer using the breast-cancer-wisconsin-diagnostic-dataset (scikit-learn 7.2.7. Breast cancer wisconsin (diagnostic) dataset). Goal is to predict whether breast cancer is Malignant or Bening.
Part 2: Apply various transformations, imputers, encoders-scalers using Pipelines with DecisionTreeClassifiers. Work with gridsearch to find the best parameters. Goal is to predict whether income exceeds $50K/yr based on census data.
Prerequisites
DecisionTreeClassifier Pipeline SimpleImputer StandardScaler OneHotEncoder ColumnTransformer GridSearchCV
Data source/summary:
Part 1: 569 instances with 30 numeric attributes. Class distribution: 212 - Malignant, 357 - Benign Follow the link below for the full description of the dataset. https://scikit-learn.org/stable/datasets/#breast-cancer-wisconsin-diagnostic-dataset
Part 2: income.csv is used for training set. 32561 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).
income_test.csv is used for testing and report scores. 15315 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).
Goal is to predict whether income exceeds $50K/yr based on census data. Link: http://archive.ics.uci.edu/ml/datasets/Adult
(Optional) Further Links/Credits to Relevant Resources:
This exercise was assigned in the machine learning course at Aristotle University of THessaloniki and the solution was my submission at this.
@iakovidva Great idea ! Please feel free to work on it and create a PR when you are done. Do check out other existing projects and the contributing guidelines to figure out the practice and format of things. Please do let us know if you have any questions. Thanks !