MSCG-Net
MSCG-Net copied to clipboard
Automatic train-val split
Is there an option to play around with the size of the train and val sets. For example, use x% of the train set as val set, instead of predefining both train and val sets manually at the beginning?
You can change kf, k_folder,
in train_args
for cross-validation training
e.g., you can set kf=0, k_folder=5
, means you use 5-folder for validation, the current training loop using folder 0
, Note kf < k_folder
, if kf=k_folder=0
as defaut, means no CV training.
To train with cross-validation, you need to manually set kf
from 0, 1, ...,
til k_folder-1 (e.g., 4)
train_args = agriculture_configs(net_name='MSCG-Rx50',
data='Agriculture',
bands_list=['NIR', 'RGB'],
kf=0, k_folder=5, # change kf to 0,1,2,3,4 for CV training
note='reproduce_ACW_loss2_adax'
)
kf =0 and k_folder=5 did not automatically create 20% of val set from the train set and threw the following error:
----------creating groundtruth data for training./.val---------------
Traceback (most recent call last):
File "/scratch/manu/MSCG-Net-master_selftrained/./tools/train_ethz.py", line 29, in
Does that mean a dummy val set need to be still provided?
reminder :)
This split is specially designed for agriculture-vision dataset, that only split the offcial val-set by kfolders, not split the training set. If you want only split train set into train and val set, you need to modify the function split_train_val_test_sets
like;
def split_train_val_test_sets(data_folder=Data_Folder, name='Agriculture', bands=['NIR','RGB'], KF=3, k=1, seeds=69278):
train_id, t_list = get_training_list(root_folder=TRAIN_ROOT, count_label=False)
VAL_ROOT = TRAIN_ROOT
# val_id, v_list = get_training_list(root_folder=VAL_ROOT, count_label=False) # if you dont have val folder
if KF >=2:
kf = KFold(n_splits=KF, shuffle=True, random_state=seeds)
val_ids = np.array(v_list)
idx = list(kf.split(np.array(val_ids)))
if k >= KF: # k should not be out of KF range, otherwise set k = 0
k = 0
t2_list, v_list = list(val_ids[idx[k][0]]), list(val_ids[idx[k][1]])
else:
print("k folders shoule be larger than k")
return -1
img_folders = [os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name][band]) for band in bands]
gt_folder = os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name]['GT'])
val_folders = [os.path.join(data_folder[name]['ROOT'], 'val', data_folder[name][band]) for band in bands]
val_gt_folder = os.path.join(data_folder[name]['ROOT'], 'val', data_folder[name]['GT'])
train_dict = {
IDS: train_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in t2_list],
GT: [gt_folder.format(id) for id in t_list] + [val_gt_folder.format(id) for id in t2_list],
'all_files': t2_list
}
val_dict = {
IDS: val_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
GT: [val_gt_folder.format(id) for id in v_list],
'all_files': v_list
}
# here set test_dict = val_dict, not real test-set
test_dict = {
IDS: val_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
GT: [val_gt_folder.format(id) for id in v_list],
}
print('train set -------', len(train_dict[GT]))
print('val set ---------', len(val_dict[GT]))
return train_dict, val_dict, test_dict
Is there a way to deactivate the val set and use only train and test sets?
I'm not sure what your point is. If you intended to use test-set servering as the val-set, simply change the val-folder to the test-folder.
If I use test set as val set in the iterative training process, the model will overfit on this val set. What I want is a model trained agnostic of the test set using either a setting with no val set (during training) or a val set which is a subset of the train set itself (created randomly and automaticall and not manually before teh training starts). I hope I didn't confuse you.
Now I see, lets saying you have a training set with 100 images, and a test-set with 50 images, you can split the training set into e.g., train/val 80/20 randomly with 5 folders, then you can train your model with these 5-folder to get 5 best ckpt weights, then you can test your trained 5 ckpt weights in the test-set (50 images) seperately or ensemblling some or all of them as you like. This is the most common way.
Another way as you said, 'no-val-set', you train you model with all 100 images without validation, however, if you don't validate the model during trainiing, you still need to save the best ckpt weighting every a certain number of epoch (e.g., 200 epoches, etc) either based on the best loss or best metric (e.g., F1) evaluted on all 100 images (using whole train-set itself as val-set), or evaluted on a randomly seleted part of the 100 images. If so, you need to modify your training pipeline accordingly. I think it's possible and not hard to implement.
e.g., train/val 80/20 randomly with 5 folders, then you can train your model with these 5-folder to get 5 best ckpt weights, then you can test your trained 5 ckpt weights in the test-set (50 images) seperately or ensemblling some or all of them as you like. This is the most common way.
Yes,I want this option but the 80/20 split must be done automatically and not manually. Is that possible with the current code?
Yes, it's possible, just slightly need to change the code split_train_val_test_sets
as following ( might be some bugs, you can further modify it).
# change DATASET ROOT to your dataset path
DATASET_ROOT = '/media/liu/diskb/data/Agriculture-Vision'
TRAIN_ROOT = os.path.join(DATASET_ROOT, 'train')
def split_train_val_test_sets(data_folder=Data_Folder, name='Agriculture', bands=['NIR','RGB'], KF=5, k=0, seeds=69278):
train_id, t_list = get_training_list(root_folder=TRAIN_ROOT, count_label=False)
VAL_ROOT = TRAIN_ROOT
if KF >2: # must to be larger than 1
kf = KFold(n_splits=KF, shuffle=True, random_state=seeds)
val_ids = np.array(t_list)
idx = list(kf.split(val_ids))
if k >= KF: # k should not be out of KF range, otherwise set k = 0
k = 0
tr_list, v_list = list(val_ids[idx[k][0]]), list(val_ids[idx[k][1]])
else:
print("k folders shoule be larger than k")
return -1
img_folders = [os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name][band]) for band in bands]
gt_folder = os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name]['GT'])
val_folders = [os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name][band]) for band in bands]
val_gt_folder = os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name]['GT'])
train_dict = {
IDS: train_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in tr_list],
GT: [val_gt_folder.format(id) for id in tr_list],
'all_files': tr_list
}
val_dict = {
IDS: val_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
GT: [val_gt_folder.format(id) for id in v_list],
'all_files': v_list
}
# here set test_dict = val_dict, not real test-set
test_dict = {
IDS: val_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
GT: [val_gt_folder.format(id) for id in v_list],
}
print('train set -------', len(train_dict[GT]))
print('val set ---------', len(val_dict[GT]))
return train_dict, val_dict, test_dict
I will try with the above. But, I did not understand why you declared the val and test dicts similarly?
test_dict is not used at all during training, you can safely deleted it if you want, and just return train_dict, and val_dict. I leave test_dict here just for furture modification and debug on real test set etc.
the code was not written well and contains some confused names / redundant stuff / bugs without refactoring after completing the agriculture workshop. You need to choose the useful part and rewrite them as you want.