Running train_test_split.py
Can you give an example as to what arguments need to be given to train_test_split.py for lets say, the breast cancer data.
Hello, sorry for the delay.
That script is just a command line version of the train_test_split of scikit-learn.
After cloning the repository, make sure that the project root directory is on the PYTHONPATH. E.g, if you are using a python 3.6 and a virtualenv called venv, go to the project root directory and run:
pwd > venv/lib/python3.6/site-packages/imputation-dgm.pth
Now download and parse the data:
python imputation_dgm/pre_processing/breast/download_and_transform.py
This will:
- create the directory data/breast
- download wdbc.data inside data/breast
- create features.npy, labels.npy and metadata.json inside data/breast
Now e.g. you can split the features into 80% train and 20% test:
python imputation_dgm/pre_processing/train_test_split.py \
--features_format=dense \
data/breast/features.npy \
0.8 \
data/breast/features-train.npy \
data/breast/features-test.npy
or if you want to include the labels:
python imputation_dgm/pre_processing/train_test_split.py \
--features_format=dense \
data/breast/features.npy \
0.8 \
data/breast/features-train.npy \
data/breast/features-test.npy \
--labels_format=dense \
--labels=data/breast/labels.npy \
--train_labels=data/breast/labels-train.npy \
--test_labels=data/breast/labels-test.npy