MultiObjectiveOptimization Cityscapes experiment

Hi, Thanks for open sourcing the code, this is great! Could you share your json parameter file for cityscapes? Also, I think it is is missing the file depth_mean.npy to be able to run it.

Thanks.

Feb 07 '19 17:02 maria8899

We will update the code in a few days with depth_mean.npy and config files. But, in the mean time here are the config files if you do not want to wait:

Parameter Set w/ Approximation: optimizer=Adam|batch_size=8|lr=0.0005|dataset=cityscapes|normalization_type=none|algorithm=mgda|use_approximation=True
Parameter Set w/o Approximation: optimizer=Adam|batch_size=8|lr=0.0001|dataset=cityscapes|normalization_type=none|algorithm=mgda|use_approximation=False

depth_mean.npy is the average depth map of training set. We use it for making the input zero mean.

Feb 08 '19 14:02 ozansener

Thanks. Will the update code be running with pytorch 1.0? I am getting into a few problems to run it since some features are deprecated (e.g. volatile variables, .data[0], etc.)

Feb 08 '19 15:02 maria8899

I am also having an error with the FW step: sol, min_norm = MinNormSolver.find_min_norm_element([grads[t] for t in tasks])

----> 1 sol, min_norm = MinNormSolver.find_min_norm_element([grads[t] for t in tasks])

~/MultiObjectiveOptimization/min_norm_solvers.py in find_min_norm_element(vecs)
     99         # Solution lying at the combination of two points
    100         dps = {}
--> 101         init_sol, dps = MinNormSolver._min_norm_2d(vecs, dps)
    102 
    103         n=len(vecs)

~/MultiObjectiveOptimization/min_norm_solvers.py in _min_norm_2d(vecs, dps)
     42                     dps[(i, j)] = 0.0
     43                     for k in range(len(vecs[i])):
---> 44                         dps[(i,j)] += torch.dot(vecs[i][k], vecs[j][k]).data[0]
     45                     dps[(j, i)] = dps[(i, j)]
     46                 if (i,i) not in dps:

RuntimeError: dot: Expected 1-D argument self, but got 4-D

What exactly should containgrads and [grads[t] for t in tasks]?

Edit: the solution is to replace with torch.dot(vecs[i][k].view(-1), vecs[j][k].view(-1)).item()

Feb 08 '19 17:02 maria8899

I can confirm that those changes worked for me to get the code running with pytorch 1.0. I was able to reproduce the results for the single task models, but so far no luck with the mgda method. Did you have to include any other changes @maria8899

Feb 11 '19 20:02 SimonVandenhende

@r0456230 Can you tell me what exactly you are trying to reproduce? The config files I put as a comment should give exact results of mgda w/ and w/o approximation.

Please note that; we report disparity metric in paper and compute depth metric in the code. Depth map is later separately converted into disparity as post-processing. If the issue is depth, this should explain it.

mIoU should be exactly same with what reported in the code and the paper. We used the parameters I posted as a comment.

Feb 11 '19 20:02 ozansener

@maria8899 Although we are planning to support pytorch 1.0, I am not sure when will it be. I will also update the ReadMe with the exact versions of each Python module we used. Pytorch was 0.3.1

Feb 11 '19 20:02 ozansener

@ozansener I was able to reproduce the results from the paper for the single task models using your code (depth, instance segmentation and semantic segmentation on cityscapes). However, when I run the code with the parameters posted above, after 50 epochs the models seems to be far removed from the results obtained in the paper.

Feb 12 '19 15:02 SimonVandenhende

I think I have managed to make it work with pytorch 1.0, but I still need to check the results and train it fully. @r0456230 I haven't done much other changes, the main problem was in this FW step. Have you setup the scales/tasks correctly in the json file?

Feb 13 '19 12:02 maria8899

@maria8899 @r0456230 could you please tell me how to solve depth_mean.npy missing problem?

I tried the code below, but I'm not sure if it's correct

depth_mean = np.mean([depth!=0])
depth[depth!=0] = (depth[depth!=0] - depth_mean) / self.DEPTH_STD
#depth[depth!=0] = (depth[depth!=0] - self.DEPTH_MEAN[depth!=0]) / self.DEPTH_STD

Feb 18 '19 03:02 YoungGer

@YoungGer you need to compute a mean image (per pixel) using all training images (or just a few to get an approximation) of the Cityscapes's disparity dataset.

Feb 18 '19 11:02 maria8899

@YoungGer you need to compute a mean image (per pixel) using all training images (or just a few to get an approximation) of the Cityscapes's disparity dataset.

I know, thank you for your help.

Feb 26 '19 08:02 YoungGer

@YoungGer Have you noticed, that the find_min_norm_element method actually uses the projected gradient descent method? Only find_min_norm_element_FW is the Frank-Wolfe algorithm as discussed in the paper. They are only guaranteed to be equivalent for a number of tasks equal to 2.

Mar 20 '19 10:03 JulienSiems

EDIT: Obviously I realised right after sending that question 2 is because of the optimization. Question 1 still remains.

Hi @ozansener , thanks for publishing your code!

I have two questions after reading this answer by @maria8899

Edit: the solution is to replace with torch.dot(vecs[i][k].view(-1), vecs[j][k].view(-1)).item()

In your code the z variable returned by the 'backbone' of the network is passed to each task. Its gradient is then used in the find_min_norm algorithm.

First of all as maria noted, the gradient is a 4D variable that needs to be reshaped first to 1D in torch 1.0.1. I compared the behavior to torch 0.3.1 and it does lead to the same result, but it raised some questions, which very well might come from my missing understanding of your paper.

The gradient still has the batch dimension, why do you calculate the the min_norm_point between all samples as one big vector instead of for example averaging or summing over the batch dimension? Isn't this what is effectively happen after the reshaping? This is just intuitively speaking comparing it with stochastic gradient descent.
Why is there a batch dimension anyway? From the paper it is not quite clear to me what should be fed into the FrankWolfeSolver, but shouldn't it be the gradient of some parameters instead of an output variable? Or does that not matter and lead to the same result?

Thanks a lot!

Mar 29 '19 10:03 kilianyp

@kilsenp First let me answer the 2.

2: You are right if you apply MGDA directly, it should be gradients with respect to parameters. However, one of the main contributions of the paper is showing that instead you can actually feed gradients with respect to the representations. This is basically the Section 3.3 of the paper and what we are computing in the code is $\nabla_Z$.
1: No, you need batch dimension since forward pass of the network is different for each image. You can read Section 3.3 of the paper in detail to understand whats going on.

May 01 '19 13:05 ozansener

Hi, @maria8899, @kilsenp, Have you reproduced the results on MultiMNIST or CityScapes? Thanks.

Apr 14 '20 09:04 liyangliu

Hi,@liyangliu , Have you reproduced the results on MultiMNIST? I have tried but only got the result like grid search.Would you like to tell me the params you chosen?Thanks.

May 02 '20 09:05 youandeme

@youandeme, I haven't reproduced the results on MultiMNIST. I used the same hyper parameters mentioned by the author in https://github.com/intel-isl/MultiObjectiveOptimization/issues/9, but can not surpass the uniform scaling baseline. Also, I noticed in the "Gradient Surgery" paper (supplementary materials), other researchers report different results on MultiMNIST from this MOO paper. So I doubt that others also have difficulty in reproducing the results on MultiMNIST following this MOO paper.

May 04 '20 03:05 liyangliu

@liyangliu @youandeme How are you evaluating the MultiMNIST? We did not release any test set, actually there is no test set. The code generates random test set every time you run. For all modules, you simply use the hyper-params I put. Then, you save every epoch result and choose the best epoch with the best val accuracy. Then, you call MultiMNIST test which will generate random test set and evaluate it. If you call the MultiMNIST loader with test param, it should do the trick. If you evaluate this way, the result are not exactly matching since test set is randomly generated, but the order of methods is preserved.

May 04 '20 07:05 ozansener

Hi, @ozansener, as you mentioned, the order of different methods (MGDA-UB vs. uniform scaling) will keep the same whatever test set I use. But on the validation set, I cannot find the superiority of MGDA-UB upon uniform scaling. Also, on CityScapes I cannot reproduce the results reported in the paper. Actually I find that single-task baseline is better than the reported ones (10.28 vs. 11.34, 64.04 vs. 60.68 on the instance and semantic segmentation task respectively). I obtain these numbers with your provided code, so maybe I made some mistakes?

May 04 '20 08:05 liyangliu

@liyangliu For MultiMNIST, I think there are issues since we did not release a test. Everyone reports slightly different numbers. In hindsight, we should have released the test set but did not even save it. So, I would say please report whatever number you obtained for MultiMNIST. For Cityscapes though, it is strange as many people re-produced the numbers. Please send me an e-mail about the CityScapes so we can discuss.

May 04 '20 14:05 ozansener

Thanks. @ozansener. On CityScapes I re-run your code with instance & semantic segmentation task and get the following results for MGDA-UB and SINGLE task, respectively:

method	instance	semantic
MGDA-UB	15.88	64.53
SINGLE	10.28	64.04
MGDA-UB (paper)	10.25	66.63
SINGLE (paper)	11.34	60.08

It seems that the performance of instance segmentation is a bit strange.

May 05 '20 15:05 liyangliu

@liyangliu Instance segmentation one looks strange. Are you using the hyper-params I posted for both single task and multi-task. Also, are the tasks uniformly scaled or are you doing any search. Let me know the setup.

May 08 '20 12:05 ozansener

@ozansener, I use exactly the hyper-params you posted for single & multi-task training. I use 1/0 and 0/1 scale for single task training (instance and semantic segmentation) and didn't do any grid search.

May 08 '20 13:05 liyangliu

Sorry for an off-topic question, but I have trouble even running the training on CityScapes: for 256x512 input I get 32x64 output, while the target is 256x512. And smaller output makes sense to me because of the not dilated earlier layers & maxpooling. So could someone please clear up for me, whether target indeed should have the same dimensions as input, and if so, where the spatial upsampling is supposed to happen?

May 08 '20 17:05 AwesomeLemon

MultiObjectiveOptimization MultiObjectiveOptimization copied to clipboard

Cityscapes experiment

MultiObjectiveOptimization
MultiObjectiveOptimization copied to clipboard