Running train_test_split.py

Open samyakag opened this issue 6 years ago • 1 comments

Can you give an example as to what arguments need to be given to train_test_split.py for lets say, the breast cancer data.

Apr 23 '20 14:04 samyakag

Hello, sorry for the delay.

That script is just a command line version of the train_test_split of scikit-learn.

After cloning the repository, make sure that the project root directory is on the PYTHONPATH. E.g, if you are using a python 3.6 and a virtualenv called venv, go to the project root directory and run:

pwd > venv/lib/python3.6/site-packages/imputation-dgm.pth

Now download and parse the data:

python imputation_dgm/pre_processing/breast/download_and_transform.py

This will:

create the directory data/breast
download wdbc.data inside data/breast
create features.npy, labels.npy and metadata.json inside data/breast

Now e.g. you can split the features into 80% train and 20% test:

python imputation_dgm/pre_processing/train_test_split.py \
    --features_format=dense \
    data/breast/features.npy \
    0.8 \
    data/breast/features-train.npy \
    data/breast/features-test.npy

or if you want to include the labels:

python imputation_dgm/pre_processing/train_test_split.py \
    --features_format=dense \
    data/breast/features.npy \
    0.8 \
    data/breast/features-train.npy \
    data/breast/features-test.npy \
    --labels_format=dense \
    --labels=data/breast/labels.npy \
    --train_labels=data/breast/labels-train.npy \
    --test_labels=data/breast/labels-test.npy

Apr 26 '20 13:04 rcamino