Amazon-Forest-Computer-Vision icon indicating copy to clipboard operation
Amazon-Forest-Computer-Vision copied to clipboard

SmoothF2Loss

Open ahkarami opened this issue 6 years ago • 8 comments

Dear @mratsim, Thank you for your nice code. Have you ever try SmoothF2Loss (in p2_metrics.py) in training? Does this loss function yields appropriate results? Just as another question, can we learn the threshold tensor in the training process (instead of optimize it)?

ahkarami avatar Nov 09 '17 21:11 ahkarami

I tried to use SmoothL2Loss but it didn't help when I did. It was approximately after 1 week out of the 2 weeks I worked on the competition, and I didn't do exhaustive experiments like trying balanced_weights + SmoothF2Loss.

Unfortunately, we can't learn the threshold because it's not differentiable.

The top solutions on Kaggle were at 0.933 something and my solution at 0.92x something. I believe that the main bottlenecks were:

  • thresholding for a real integrated end-to-end learner.
  • making the network aware of labels relationship (if you havve cloud, you don't see anything, if you have a road there is a greater chance of home or agriculture)

I've collected a lot of papers regarding F-Loss but did not find any way to integrate them. 2017-11-10_09-36-24

So my idea were:

  • Build a RNN in parallel with the CNN pipeline and then concatenated, it will embed the labels vocabulary, concatenate with the CNN output in a linear layer and then make them predict a series of labels
  • Build a RNN after the CNN pipeline, it's a bit harder because you need Keras-like "TimeDistributed" to repeat the image sevarl times to the network and PyTorch did not offer it at the time iirc.

mratsim avatar Nov 10 '17 08:11 mratsim

Dear @mratsim, Thank you for your complete answer. The winner of this Kaggle completion has used a special soft F2-loss and also for modeling the labels relationship he has used ridge regression model ( to take advantage of label correlations). Understanding the Amazon from Space, 1st Place Winner's Interview However, In fact I don't know how exactly we can leverage these methods. Sorry, one more question about your code. I have used your implemented L-BFGS-B with Basinhopping optimization technique for finding the best threshold vector, however the computational cost is very heavy. I have a Multi-Label Dataset with 80 classes and ~70,000 train images and ~35,000 validation images. I used the L-BFGS-B with Basinhopping optimization code (with your parameters, except number of classes set to 80 instead of 17), But it takes more than ~2 hours and the optimization isn't complete yet. Is the computational cost of this technique for such data set heavy?

ahkarami avatar Nov 10 '17 10:11 ahkarami

Dear @mratsim, In addition to above-mentioned notes, I want to add one extra note. I have figured out that in Multi-label Classification problem (with use of Sigmoid Layer at the end of the net) each output of Sigmoid is independent to other outputs. I mean that we can consider each class probability is independently produced. As a result, because the L-BFGS-B with Basinhopping Optimization computationally expensive, so we can use an Exhaustive Search for each class threshold independently. For example we have 80 class labels. at first we initialize all threshold values to 0.5 (i.e., threshold vector set to [0.5] * 80). then we search the best value for the first threshold value from the interval [0. up to 1.] with for example 0.05 step (i.e., examine exhaustively 0.05, 0.1, 0.15, 0.2 ,...,1). Then the best threshold value which maximizes the F2-Score must be selected and set. Now we find the approximately the best threshold value for the first class. After that we do the same procedure for other threshold values. What's your opinion about this Exhaustive Threshold Optimization Method? Does this method better than the L-BFGS-B with Basinhopping Optimization?

ahkarami avatar Nov 10 '17 18:11 ahkarami

Very interesting read, I missed that. I think the main take-away here is that ensembling still wins ;). I'm still convinced that a CNN+RNN to model label relationship would give an additional leap in score.

Regarding L-BFGS with Basinhopping, it is indeed extremely slow, the main issue is that scipy is single-threaded. PyTorch does have a L-BFGS algorithm that I didn't try, it's probably multi-threaded and GPU-aware so you may be able to get an enormous speedup by using that.

Your Exhaustive Threshold Optimization Method has the same issue as GridSearch vs RandomSearch for hyperparameter optimization, see here and here.

image

If you have rocky mountains and valleys you will miss the optimal threshold. You need to atleast use a local optimizer like L-BFGS to check if ±0.01 around your exhaustive points there isn't a better threshold.

It will be more exhaustive but also more costly computationally than Basinhopping (which is a global stochastic optimizer, global in the sense that contrary to L-BFGS it will not get stuck in local minima).

i.e. Embrace the randomness, it will get you good results, with high efficiency. (typical example is Monte-Carlo Tree Search as used in board-games, strategy games and robot competitions)

mratsim avatar Nov 10 '17 18:11 mratsim

Thank you very much for your nice and fantastic response.

ahkarami avatar Nov 10 '17 21:11 ahkarami

Dear @mratsim, You have mentioned that modeling the relationships between labels can further improve the classification results. I agree with you there. And you have proposed two fantastic ideas for that (in your first response). I have also founded that you greatly implemented some CNN+RNN Models (e.g., GRU_ResNet50). Do your implemented CNN+RNN models work via the main_pytorch.py script? If not, whats their problem and how can we address their problem?

ahkarami avatar Nov 15 '17 18:11 ahkarami

I started playing with it but then I lacked timed as I worked on my other project (Arraymancer, a deep learning library in Nim, speed of C, ergonomics of Python).

I explored 2 different architectures: CNN then RNN in sequence and CNN+RNN in parallel then concatenation into Linear layers.

You can check it there:

mratsim avatar Nov 15 '17 21:11 mratsim

Wow!

By the way, do you have example for Regression Problem using Python (The output value is continuous, not discrete)? Like estimating location of something on an image, etc... I'd say like the Face Features in Kaggle.

All I found is for simple Linear Regression.

P. S. I found this form Feedback on PyTorch for Kaggle competitions.

RoyiAvital avatar Apr 22 '18 15:04 RoyiAvital