MultiObjectiveOptimization
MultiObjectiveOptimization copied to clipboard
Cityscapes experiment
Hi,
Thanks for open sourcing the code, this is great!
Could you share your json parameter file for cityscapes?
Also, I think it is is missing the file depth_mean.npy
to be able to run it.
Thanks.
We will update the code in a few days with depth_mean.npy
and config files. But, in the mean time here are the config files if you do not want to wait:
-
Parameter Set w/ Approximation:
optimizer=Adam|batch_size=8|lr=0.0005|dataset=cityscapes|normalization_type=none|algorithm=mgda|use_approximation=True
-
Parameter Set w/o Approximation:
optimizer=Adam|batch_size=8|lr=0.0001|dataset=cityscapes|normalization_type=none|algorithm=mgda|use_approximation=False
depth_mean.npy
is the average depth map of training set. We use it for making the input zero mean.
Thanks. Will the update code be running with pytorch 1.0? I am getting into a few problems to run it since some features are deprecated (e.g. volatile variables, .data[0]
, etc.)
I am also having an error with the FW step:
sol, min_norm = MinNormSolver.find_min_norm_element([grads[t] for t in tasks])
----> 1 sol, min_norm = MinNormSolver.find_min_norm_element([grads[t] for t in tasks])
~/MultiObjectiveOptimization/min_norm_solvers.py in find_min_norm_element(vecs)
99 # Solution lying at the combination of two points
100 dps = {}
--> 101 init_sol, dps = MinNormSolver._min_norm_2d(vecs, dps)
102
103 n=len(vecs)
~/MultiObjectiveOptimization/min_norm_solvers.py in _min_norm_2d(vecs, dps)
42 dps[(i, j)] = 0.0
43 for k in range(len(vecs[i])):
---> 44 dps[(i,j)] += torch.dot(vecs[i][k], vecs[j][k]).data[0]
45 dps[(j, i)] = dps[(i, j)]
46 if (i,i) not in dps:
RuntimeError: dot: Expected 1-D argument self, but got 4-D
What exactly should containgrads
and [grads[t] for t in tasks]
?
Edit: the solution is to replace with
torch.dot(vecs[i][k].view(-1), vecs[j][k].view(-1)).item()
I can confirm that those changes worked for me to get the code running with pytorch 1.0. I was able to reproduce the results for the single task models, but so far no luck with the mgda method. Did you have to include any other changes @maria8899
@r0456230 Can you tell me what exactly you are trying to reproduce? The config files I put as a comment should give exact results of mgda w/ and w/o approximation.
Please note that; we report disparity metric in paper and compute depth metric in the code. Depth map is later separately converted into disparity as post-processing. If the issue is depth, this should explain it.
mIoU should be exactly same with what reported in the code and the paper. We used the parameters I posted as a comment.
@maria8899 Although we are planning to support pytorch 1.0, I am not sure when will it be. I will also update the ReadMe with the exact versions of each Python module we used. Pytorch was 0.3.1
@ozansener I was able to reproduce the results from the paper for the single task models using your code (depth, instance segmentation and semantic segmentation on cityscapes). However, when I run the code with the parameters posted above, after 50 epochs the models seems to be far removed from the results obtained in the paper.
I think I have managed to make it work with pytorch 1.0, but I still need to check the results and train it fully. @r0456230 I haven't done much other changes, the main problem was in this FW step. Have you setup the scales/tasks correctly in the json file?
@maria8899 @r0456230 could you please tell me how to solve depth_mean.npy
missing problem?
I tried the code below, but I'm not sure if it's correct
depth_mean = np.mean([depth!=0])
depth[depth!=0] = (depth[depth!=0] - depth_mean) / self.DEPTH_STD
#depth[depth!=0] = (depth[depth!=0] - self.DEPTH_MEAN[depth!=0]) / self.DEPTH_STD
@YoungGer you need to compute a mean image (per pixel) using all training images (or just a few to get an approximation) of the Cityscapes's disparity dataset.
@YoungGer you need to compute a mean image (per pixel) using all training images (or just a few to get an approximation) of the Cityscapes's disparity dataset.
I know, thank you for your help.
@YoungGer Have you noticed, that the find_min_norm_element method actually uses the projected gradient descent method? Only find_min_norm_element_FW is the Frank-Wolfe algorithm as discussed in the paper. They are only guaranteed to be equivalent for a number of tasks equal to 2.
EDIT: Obviously I realised right after sending that question 2 is because of the optimization. Question 1 still remains.
Hi @ozansener , thanks for publishing your code!
I have two questions after reading this answer by @maria8899
Edit: the solution is to replace with torch.dot(vecs[i][k].view(-1), vecs[j][k].view(-1)).item()
In your code the z variable returned by the 'backbone' of the network is passed to each task. Its gradient is then used in the find_min_norm algorithm.
First of all as maria noted, the gradient is a 4D variable that needs to be reshaped first to 1D in torch 1.0.1. I compared the behavior to torch 0.3.1 and it does lead to the same result, but it raised some questions, which very well might come from my missing understanding of your paper.
- The gradient still has the batch dimension, why do you calculate the the min_norm_point between all samples as one big vector instead of for example averaging or summing over the batch dimension? Isn't this what is effectively happen after the reshaping? This is just intuitively speaking comparing it with stochastic gradient descent.
- Why is there a batch dimension anyway? From the paper it is not quite clear to me what should be fed into the FrankWolfeSolver, but shouldn't it be the gradient of some parameters instead of an output variable? Or does that not matter and lead to the same result?
Thanks a lot!
@kilsenp First let me answer the 2.
-
2: You are right if you apply MGDA directly, it should be gradients with respect to parameters. However, one of the main contributions of the paper is showing that instead you can actually feed gradients with respect to the representations. This is basically the Section 3.3 of the paper and what we are computing in the code is $\nabla_Z$.
-
1: No, you need batch dimension since forward pass of the network is different for each image. You can read Section 3.3 of the paper in detail to understand whats going on.
Hi, @maria8899, @kilsenp, Have you reproduced the results on MultiMNIST or CityScapes? Thanks.
Hi,@liyangliu , Have you reproduced the results on MultiMNIST? I have tried but only got the result like grid search.Would you like to tell me the params you chosen?Thanks.
@youandeme, I haven't reproduced the results on MultiMNIST. I used the same hyper parameters mentioned by the author in https://github.com/intel-isl/MultiObjectiveOptimization/issues/9, but can not surpass the uniform scaling baseline. Also, I noticed in the "Gradient Surgery" paper (supplementary materials), other researchers report different results on MultiMNIST from this MOO paper. So I doubt that others also have difficulty in reproducing the results on MultiMNIST following this MOO paper.
@liyangliu @youandeme How are you evaluating the MultiMNIST? We did not release any test set, actually there is no test set. The code generates random test set every time you run. For all modules, you simply use the hyper-params I put. Then, you save every epoch result and choose the best epoch with the best val accuracy. Then, you call MultiMNIST test which will generate random test set and evaluate it. If you call the MultiMNIST loader with test param, it should do the trick. If you evaluate this way, the result are not exactly matching since test set is randomly generated, but the order of methods is preserved.
Hi, @ozansener, as you mentioned, the order of different methods (MGDA-UB vs. uniform scaling) will keep the same whatever test set I use. But on the validation set, I cannot find the superiority of MGDA-UB upon uniform scaling. Also, on CityScapes I cannot reproduce the results reported in the paper. Actually I find that single-task baseline is better than the reported ones (10.28 vs. 11.34, 64.04 vs. 60.68 on the instance and semantic segmentation task respectively). I obtain these numbers with your provided code, so maybe I made some mistakes?
@liyangliu For MultiMNIST, I think there are issues since we did not release a test. Everyone reports slightly different numbers. In hindsight, we should have released the test set but did not even save it. So, I would say please report whatever number you obtained for MultiMNIST. For Cityscapes though, it is strange as many people re-produced the numbers. Please send me an e-mail about the CityScapes so we can discuss.
Thanks. @ozansener. On CityScapes I re-run your code with instance & semantic segmentation task and get the following results for MGDA-UB and SINGLE task, respectively:
method | instance | semantic |
---|---|---|
MGDA-UB | 15.88 | 64.53 |
SINGLE | 10.28 | 64.04 |
MGDA-UB (paper) | 10.25 | 66.63 |
SINGLE (paper) | 11.34 | 60.08 |
It seems that the performance of instance segmentation is a bit strange.
@liyangliu Instance segmentation one looks strange. Are you using the hyper-params I posted for both single task and multi-task. Also, are the tasks uniformly scaled or are you doing any search. Let me know the setup.
@ozansener, I use exactly the hyper-params you posted for single & multi-task training. I use 1/0 and 0/1 scale for single task training (instance and semantic segmentation) and didn't do any grid search.
Sorry for an off-topic question, but I have trouble even running the training on CityScapes: for 256x512 input I get 32x64 output, while the target is 256x512. And smaller output makes sense to me because of the not dilated earlier layers & maxpooling. So could someone please clear up for me, whether target indeed should have the same dimensions as input, and if so, where the spatial upsampling is supposed to happen?