gqcnn
gqcnn copied to clipboard
Issue: Bug/Performance Issue [Custom Images] - training on dexnet compatible dataset result in gqcnn unable to predict good grasps (pred nonzero is always '0')
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- Python version: 2.7.12
- Installed using pip or ROS: pip
- Camera: default
Describe what you are trying to do Trying to train GQCNN from scratch on a custom dataset and also trying to fine tune a pretrained GQCNN_2.0 on custom dataset. (Datasets are created using dex-net API).
Describe current behavior Training or finetuning (Optimizing CNN also) using the dataset results in a behavior such that the network is unable to make any good grasp prediction. Referring to the log output the - 'Pred nonzero' is always 0. Even after 5 to 10 iterations in case of finetuning. Is this a normal behavior?
Describe the expected behavior I am expecting the network to make at least a few good grasps out of the available good grasps. Interestingly, when I keep the layers upto fc3 or fc4 as base layer and DO NOT optimise the base layer the network seem to get finetuned properly and it is predicting some good grasps but still the error rate is high.
Describe the input images The input dataset is generated using a dexnet compatible hdf5 database and I used dex-net API to generate the dataset from these. Source of database - https://dougsm.github.io/egad/ (please see section dex-net compatible data.)
Describe the physical camera setup generated using dexnet API
Other info / logs Few lines of Training logs:
GQCNNTrainerTF INFO Step took 2.304 sec.
GQCNNTrainerTF INFO Max 0.23993634
GQCNNTrainerTF INFO Min 0.14524038
GQCNNTrainerTF INFO Pred nonzero 0
GQCNNTrainerTF INFO True nonzero 15
GQCNNTrainerTF INFO Step 27312 (epoch 1.426), 0.02 s
GQCNNTrainerTF INFO Minibatch loss: 0.478, learning rate: 0.009025
GQCNNTrainerTF INFO Minibatch error: 11.719
GQCNNTrainerTF INFO Step took 2.369 sec.
GQCNNTrainerTF INFO Max 0.23774128
GQCNNTrainerTF INFO Min 0.19348052
GQCNNTrainerTF INFO Pred nonzero 0
GQCNNTrainerTF INFO True nonzero 80
GQCNNTrainerTF INFO Step 27313 (epoch 1.426), 0.02 s
GQCNNTrainerTF INFO Minibatch loss: 1.077, learning rate: 0.009025
GQCNNTrainerTF INFO Minibatch error: 62.5
GQCNNTrainerTF INFO Step took 2.158 sec.
GQCNNTrainerTF INFO Max 0.23704815
GQCNNTrainerTF INFO Min 0.16592737
GQCNNTrainerTF INFO Pred nonzero 0
GQCNNTrainerTF INFO True nonzero 45
Few lines of finetuning log (fc3 set as base layer and using oldformat for layers upto fc3 and optimizing the base layers also ):
10-04 11:56:09 GQCNNTrainerTF INFO Step 191576 (epoch 9.999), 0.08 s
10-04 11:56:09 GQCNNTrainerTF INFO Minibatch loss: 0.433, learning rate: 0.004633
10-04 11:56:09 GQCNNTrainerTF INFO Minibatch error: 13.281
10-04 11:56:10 GQCNNTrainerTF INFO Step took 1.242 sec.
10-04 11:56:10 GQCNNTrainerTF INFO Max 0.25177836
10-04 11:56:10 GQCNNTrainerTF INFO Min 0.16215596
10-04 11:56:10 GQCNNTrainerTF INFO Pred nonzero 0
10-04 11:56:10 GQCNNTrainerTF INFO True nonzero 34
10-04 11:56:10 GQCNNTrainerTF INFO Step 191577 (epoch 9.999), 0.07 s
10-04 11:56:10 GQCNNTrainerTF INFO Minibatch loss: 0.577, learning rate: 0.004633
10-04 11:56:10 GQCNNTrainerTF INFO Minibatch error: 26.563
10-04 11:56:11 GQCNNTrainerTF INFO Step took 1.171 sec.
10-04 11:56:11 GQCNNTrainerTF INFO Max 0.25264603
10-04 11:56:11 GQCNNTrainerTF INFO Min 0.18788987
10-04 11:56:11 GQCNNTrainerTF INFO Pred nonzero 0
10-04 11:56:11 GQCNNTrainerTF INFO True nonzero 49
10-04 11:56:11 GQCNNTrainerTF INFO Step 191578 (epoch 10.0), 0.06 s
10-04 11:56:11 GQCNNTrainerTF INFO Minibatch loss: 0.709, learning rate: 0.004633
10-04 11:56:11 GQCNNTrainerTF INFO Minibatch error: 38.281
10-04 11:56:13 GQCNNTrainerTF INFO Step took 1.36 sec.
10-04 11:56:13 GQCNNTrainerTF INFO Max 0.25366336
10-04 11:56:13 GQCNNTrainerTF INFO Min 0.17693533
10-04 11:56:13 GQCNNTrainerTF INFO Pred nonzero 0
10-04 11:56:13 GQCNNTrainerTF INFO True nonzero 16
10-04 11:56:13 GQCNNTrainerTF INFO Step 191579 (epoch 10.0), 0.07 s
10-04 11:56:13 GQCNNTrainerTF INFO Minibatch loss: 0.423, learning rate: 0.004633
10-04 11:56:13 GQCNNTrainerTF INFO Minibatch error: 12.5
10-04 11:56:14 GQCNNTrainerTF INFO Step took 1.24 sec.
10-04 11:56:14 GQCNNTrainerTF INFO Max 0.25436333
10-04 11:56:14 GQCNNTrainerTF INFO Min 0.1827491
10-04 11:56:14 GQCNNTrainerTF INFO Pred nonzero 0
10-04 11:56:14 GQCNNTrainerTF INFO True nonzero 10
10-04 11:56:14 GQCNNTrainerTF INFO Step 191580 (epoch 10.0), 0.07 s
10-04 11:56:14 GQCNNTrainerTF INFO Minibatch loss: 0.372, learning rate: 0.004633
10-04 11:56:14 GQCNNTrainerTF INFO Minibatch error: 7.813
Another interesting thing is that the softmax output seems to be not proper, out of the 2 outputs the 1st value is always in range of 0.7 and the 2nd value is in range of 0.3 ! (varies somewhat at different trainings due to the random initialization of the weights during training) Sample softmax output:
array([[0.7649399 , 0.23506004],
[0.7651925 , 0.23480749],
[0.76295185, 0.23704815],
[0.7643285 , 0.23567156],
[0.7630225 , 0.23697755],
[0.7642536 , 0.23574635],
[0.76532423, 0.23467574],
[0.76295376, 0.23704618],
[0.7632632 , 0.23673679],
[0.76498514, 0.23501493],
[0.7632064 , 0.2367936 ],
[0.7959242 , 0.20407586],
[0.7641547 , 0.23584531],
[0.76448244, 0.23551749],
[0.76394135, 0.23605862],
[0.7647108 , 0.23528923],
[0.7639811 , 0.23601893],
[0.7649897 , 0.23501036],
[0.7647293 , 0.23527072],
[0.7651613 , 0.23483868],
[0.76307136, 0.23692863],
[0.7640458 , 0.23595421],
[0.76476514, 0.23523483],
[0.7672727 , 0.23272723],
[0.7630191 , 0.2369809 ],
[0.7645683 , 0.23543172],
[0.7641252 , 0.2358748 ],
[0.7639672 , 0.23603278],
[0.7635745 , 0.23642555],
[0.79914796, 0.20085205],
[0.7640747 , 0.23592529],
[0.76295626, 0.23704374],
[0.7648026 , 0.23519741],
[0.76468086, 0.23531915],
[0.79236376, 0.20763627],
[0.763892 , 0.23610799],
[0.76452196, 0.23547806],
[0.76323694, 0.2367631 ],
[0.76363677, 0.23636323],
[0.7694154 , 0.23058464],
.......
Hi @visatish , could you please let me know if this is a normal behavior? any clue as to what could be the reason for this...?
Yeah, I have same problem,the dataset use is DexNet2.0 dataset and output is same all Pred nonzero=0 Do you find the reason?
After testing a little more, It seems that when you train a more(around 3 epoch),Then it will not all 0; And after removing all nosing param from .yaml file,This will not all 0 at begining...
Hai @elevenjiang1 , Thanks for your comments. I observed similarly with the training epochs like you mentioned and also in addition that using a batch size of 64 strangely gives a comparatively better non zero values than using a batch size of 128. I am interested in the noise parameters that you deactivated. So did you remove all the below parameters under "# denoising / synthetic data params" in the config file for training the network ?
# denoising / synthetic data params
multiplicative_denoising: 1
gamma_shape: 1000.00
symmetrize: 1
gaussian_process_denoising: 1
gaussian_process_rate: 0.5
gaussian_process_scaling_factor: 4.0
gaussian_process_sigma: 0.005
Sorry for reply so so so late @aprath1 Actually, I find that when train a lot, this problem will be solve, remove parameters seems don't make much sense.
By the way, remove means set them to zero. Below is my yaml file for training dexnet2.0, which is base on train_dex-net_2.0.yaml
# general optimization params
train_batch_size: 64
val_batch_size: &val_batch_size 64
# logging params
num_epochs: 40 # number of epochs to train for
eval_frequency: 2 # how often to get validation error (in epochs)
save_frequency: 2 # how often to save output (in epochs)
vis_frequency: 10000 # how often to visualize filters (in epochs)
log_frequency: 300 # how often to log output (in steps)
# train / val split params
train_pct: 0.8 # percentage of the data to use for training vs validation
total_pct: 1.0 # percentage of all the files to use
eval_total_train_error: 0 # whether or not to evaluate the total training error on each validataion
max_files_eval: 1000 # the number of validation files to use in each eval
# optimization params
loss: sparse
optimizer: momentum
train_l2_regularizer: 0.0005
base_lr: 0.01
decay_step_multiplier: 0.66 # number of times to go through training datapoints before stepping down decay rate (in epochs)
decay_rate: 0.95
momentum_rate: 0.9
max_training_examples_per_load: 128
drop_rate: 0.0
max_global_grad_norm: 100000000000
# input params
training_mode: classification
image_field_name: depth_ims_tf_table
pose_field_name: hand_poses
# label params
target_metric_name: robust_ferrari_canny # name of the field to use for the labels
metric_thresh: 0.002 # threshold for positive examples (label = 1 if grasp_metric > metric_thresh)
# preproc params
num_random_files: 10000 # the number of random files to compute dataset statistics in preprocessing (lower speeds initialization)
preproc_log_frequency: 100 # how often to log preprocessing (in steps)
# denoising / synthetic data params
multiplicative_denoising: 0
gamma_shape: 1000.00
symmetrize: 0
gaussian_process_denoising: 0
gaussian_process_rate: 0.5
gaussian_process_scaling_factor: 4.0
gaussian_process_sigma: 0.005
# tensorboard
tensorboard_port: 6006
# debugging params
debug: &debug 0
debug_num_files: 10 # speeds up initialization
seed: &seed 24098
### GQCNN CONFIG ###
gqcnn:
# basic data metrics
im_height: 32
im_width: 32
im_channels: 1
debug: *debug
seed: *seed
# needs to match input data mode that was used for training, determines the pose dimensions for the network
gripper_mode: legacy_parallel_jaw
# prediction batch size, in training this will be overriden by the val_batch_size in the optimizer's config file
batch_size: *val_batch_size
# architecture
architecture:
im_stream:
conv1_1:
type: conv
filt_dim: 7
num_filt: 64
pool_size: 1
pool_stride: 1
pad: SAME
norm: 0
norm_type: local_response
conv1_2:
type: conv
filt_dim: 5
num_filt: 64
pool_size: 2
pool_stride: 2
pad: SAME
norm: 1
norm_type: local_response
conv2_1:
type: conv
filt_dim: 3
num_filt: 64
pool_size: 1
pool_stride: 1
pad: SAME
norm: 0
norm_type: local_response
conv2_2:
type: conv
filt_dim: 3
num_filt: 64
pool_size: 2
pool_stride: 2
pad: SAME
norm: 1
norm_type: local_response
fc3:
type: fc
out_size: 1024
pose_stream:
pc1:
type: pc
out_size: 16
pc2:
type: pc
out_size: 0
merge_stream:
fc4:
type: fc_merge
out_size: 1024
fc5:
type: fc
out_size: 2
# architecture normalization constants
radius: 2
alpha: 2.0e-05
beta: 0.75
bias: 1.0
# leaky relu coefficient
relu_coeff: 0.0