tfmatch icon indicating copy to clipboard operation
tfmatch copied to clipboard

Pytorch porting failed

Open dedoogong opened this issue 4 years ago • 1 comments

Dear @zjhthu @lzx551402, thanks for ur great works and code sharing! I'm struggling to reimplement your TF code into PT version, but failed to get same accuracy/loss. almost all code seems same and I actually saw same result over 1 iteration using same input while doing kinda unit test. my testing sequence is like:

[ Tesing ASLFeat Forward part ]

  1. by running sess.run(...), get numpy net_input0, 1 / depth 0,1 / K0,1 / rel_pose / dense_feat_map, sum_det_score_map of TF version.
  2. converting above numpy inputs to PT tensors (with permutating NHWC->NCHW over net_inputs and depths), and I got so silmilar sum_det_score_map(conv1+conv3+conv6 after peakiness_score calling)

[ Testing Loss part ]

  1. by saving TF's inputs(pos0, pos1, dense_feat_map0, dense_feat_map1, score_map0, score_map1) for make_detector_loss() and run both TF/PT and got exactly same loss/accuracy.

but when I run PT training, the loss doesn't drop lower than 0.6 and I checked the "moving_instance_max" values of each conv1, 3, 8's input in peakiness_score and surprisingly the values change so diffently .

TF version : ---------at the begging ---------------- step 100000 conv1 => from 6, keep growing to almost 36 - 38 conv3 => from 12, keep growing to almost 100 - 103 conv8 => from 4, keep growing to around 12 - 14

PT version : ---------at the begging ------------- step 1000(PEAK) ----------------------- step 100000 conv1 => from 6 to almost 12 - 15 ->> not growing and decrese again 1 conv3 => from 12 to almost 10 -13 ->> not growing and decrese again 1 conv8 => from 4 to around 4 - 7 ->> not growing and decrese again 0.xx

I used same ExpLR scheduler as below:

optimizer = optim.SGD(model.parameters(),  
                      momentum=0.9, lr=0.1, weight_decay=0.0001) 
scheduler_expLR = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.99999)

can u guess why feature maps are growing differently?

Plus, tf.control_dependencies([assign_moving_average(moving_instance_max, instance_max, decay)]) with "reuse" for using the same value over batches.

I think it is for getting moving average of input's max value to normalize the growing feature maps. I just update the moving average like:

instance_max = tc.max(inputs)  # tf.reduce_max(inputs) 
if self.moving_instance_max[idx] == tc.ones(1).to('cuda'): 
       self.moving_instance_max[idx] = instance_max
moving_instance_max= moving_instance_max * decay + instance_max * (1 - decay)

I think this may be one of the many reasons that trigger the different result from your code. could you explain the moving average part more in detail??

Thank you very much ~!!!

dedoogong avatar Jul 27 '20 03:07 dedoogong

Hi @dedoogong I also need to convert the code from tf to pt, just wanna know have you solved the problem? really hope to get your reply, thanks in advance!

Jemmagu avatar Mar 10 '21 09:03 Jemmagu