Amazon-Forest-Computer-Vision
Amazon-Forest-Computer-Vision copied to clipboard
SmoothF2Loss
Dear @mratsim,
Thank you for your nice code. Have you ever try SmoothF2Loss
(in p2_metrics.py
) in training? Does this loss function yields appropriate results?
Just as another question, can we learn the threshold tensor
in the training process (instead of optimize it)?
I tried to use SmoothL2Loss
but it didn't help when I did. It was approximately after 1 week out of the 2 weeks I worked on the competition, and I didn't do exhaustive experiments like trying balanced_weights + SmoothF2Loss.
Unfortunately, we can't learn the threshold
because it's not differentiable.
The top solutions on Kaggle were at 0.933 something and my solution at 0.92x something. I believe that the main bottlenecks were:
- thresholding for a real integrated end-to-end learner.
- making the network aware of labels relationship (if you havve cloud, you don't see anything, if you have a road there is a greater chance of home or agriculture)
I've collected a lot of papers regarding F-Loss but did not find any way to integrate them.
So my idea were:
- Build a RNN in parallel with the CNN pipeline and then concatenated, it will embed the labels vocabulary, concatenate with the CNN output in a linear layer and then make them predict a series of labels
- Build a RNN after the CNN pipeline, it's a bit harder because you need Keras-like "TimeDistributed" to repeat the image sevarl times to the network and PyTorch did not offer it at the time iirc.
Dear @mratsim,
Thank you for your complete answer. The winner of this Kaggle completion has used a special soft F2-loss and also for modeling the labels relationship he has used ridge regression model ( to take advantage of label correlations).
Understanding the Amazon from Space, 1st Place Winner's Interview
However, In fact I don't know how exactly we can leverage these methods.
Sorry, one more question about your code. I have used your implemented L-BFGS-B with Basinhopping
optimization technique for finding the best threshold vector, however the computational cost is very heavy. I have a Multi-Label Dataset with 80 classes and ~70,000 train images and ~35,000 validation images. I used the L-BFGS-B with Basinhopping
optimization code (with your parameters, except number of classes set to 80 instead of 17), But it takes more than ~2 hours and the optimization isn't complete yet. Is the computational cost of this technique for such data set heavy?
Dear @mratsim,
In addition to above-mentioned notes, I want to add one extra note. I have figured out that in Multi-label Classification problem (with use of Sigmoid Layer
at the end of the net) each output of Sigmoid is independent to other outputs. I mean that we can consider each class probability is independently produced. As a result, because the L-BFGS-B with Basinhopping Optimization
computationally expensive, so we can use an Exhaustive Search for each class threshold independently. For example we have 80 class labels. at first we initialize all threshold values to 0.5 (i.e., threshold vector set to [0.5] * 80). then we search the best value for the first threshold value from the interval [0. up to 1.] with for example 0.05 step (i.e., examine exhaustively 0.05, 0.1, 0.15, 0.2 ,...,1). Then the best threshold value which maximizes the F2-Score must be selected and set. Now we find the approximately the best threshold value for the first class. After that we do the same procedure for other threshold values. What's your opinion about this Exhaustive Threshold Optimization Method? Does this method better than the L-BFGS-B with Basinhopping Optimization
?
Very interesting read, I missed that. I think the main take-away here is that ensembling still wins ;). I'm still convinced that a CNN+RNN to model label relationship would give an additional leap in score.
Regarding L-BFGS with Basinhopping
, it is indeed extremely slow, the main issue is that scipy is single-threaded. PyTorch does have a L-BFGS
algorithm that I didn't try, it's probably multi-threaded and GPU-aware so you may be able to get an enormous speedup by using that.
Your Exhaustive Threshold Optimization Method
has the same issue as GridSearch
vs RandomSearch
for hyperparameter optimization, see here and here.
If you have rocky mountains and valleys you will miss the optimal threshold. You need to atleast use a local optimizer like L-BFGS
to check if ±0.01 around your exhaustive points there isn't a better threshold.
It will be more exhaustive but also more costly computationally than Basinhopping
(which is a global stochastic optimizer, global in the sense that contrary to L-BFGS it will not get stuck in local minima).
i.e. Embrace the randomness, it will get you good results, with high efficiency. (typical example is Monte-Carlo Tree Search as used in board-games, strategy games and robot competitions)
Thank you very much for your nice and fantastic response.
Dear @mratsim,
You have mentioned that modeling the relationships between labels can further improve the classification results. I agree with you there. And you have proposed two fantastic ideas for that (in your first response).
I have also founded that you greatly implemented some CNN+RNN Models (e.g., GRU_ResNet50
). Do your implemented CNN+RNN models work via the main_pytorch.py
script? If not, whats their problem and how can we address their problem?
I started playing with it but then I lacked timed as I worked on my other project (Arraymancer, a deep learning library in Nim, speed of C, ergonomics of Python).
I explored 2 different architectures: CNN then RNN in sequence and CNN+RNN in parallel then concatenation into Linear layers.
You can check it there:
- Playing with Word Embeddings + RNN so that this net "speaks" label sequences that start with <BEGIN> and end with <STOP>.
- Word Embeddings + RNN with CNN in parallel. I think I stopped there, I had trouble in the concatenation
- I think i tried to get an attention model here with an encoder and decoder RNN but didn't get very far.
Wow!
By the way, do you have example for Regression Problem using Python (The output value is continuous, not discrete)? Like estimating location of something on an image, etc... I'd say like the Face Features in Kaggle.
All I found is for simple Linear Regression.
P. S. I found this form Feedback on PyTorch for Kaggle competitions.