tsai icon indicating copy to clipboard operation
tsai copied to clipboard

New Feature: Label Distribution Smoothing (LDS) for Regression Targets

Open TKassis opened this issue 3 years ago • 8 comments

I wanted to submit a pull request to add this, but not sure where to put it. I propose adding Label Distribution Smoothing (LDS) recently presented by Yang et. al at ICML 2021. The project website can be found here along with code and video talk.

I propose adding a function prepare_weights that takes in the regression labels as input and outputs corresponding weights. The weights can then be presented to the WeightedPerSampleLoss() callback. The function, taken from Yang's repo with slight modifications to accept float targets, would look like this:


import torch
import torch.nn.functional as F
import numpy as np
from scipy.ndimage import gaussian_filter1d
from scipy.signal.windows import triang
from scipy.ndimage import convolve1d

def get_lds_kernel_window(kernel, ks, sigma):
    assert kernel in ['gaussian', 'triang', 'laplace']
    half_ks = (ks - 1) // 2
    
    if kernel == 'gaussian':
        base_kernel = [0.] * half_ks + [1.] + [0.] * half_ks
        kernel_window = gaussian_filter1d(base_kernel, sigma=sigma) / max(gaussian_filter1d(base_kernel, sigma=sigma))
    elif kernel == 'triang':
        kernel_window = triang(ks)
    else:
        laplace = lambda x: np.exp(-abs(x) / sigma) / (2. * sigma)
        kernel_window = list(map(laplace, np.arange(-half_ks, half_ks + 1))) / max(map(laplace, np.arange(-half_ks, half_ks + 1)))
    
    return kernel_window

def prepare_weights(labels, reweight, lds=True, lds_kernel='gaussian', lds_ks=5, lds_sigma=2):
    assert reweight in {'none', 'inverse', 'sqrt_inv'}
    assert reweight != 'none' if lds else True, \
        "Set reweight to \'sqrt_inv\' (default) or \'inverse\' when using LDS"

    value_dict = {x: 0 for x in list(set(labels))}  # initialize value dictionary with labels as keys
    for label in labels:
        value_dict[label] += 1  # increment counts of labels which occur multiple times
    if reweight == 'sqrt_inv':
        value_dict = {k: np.sqrt(v) for k, v in value_dict.items()}
    elif reweight == 'inverse':
        value_dict = {k: np.clip(v, 5, 1000) for k, v in value_dict.items()}  # clip weights for inverse re-weight
    num_per_label = [value_dict[label] for label in labels]
    if not len(num_per_label) or reweight == 'none':
        return None
    print(f"Using re-weighting: [{reweight.upper()}]")
    
    if lds:
        lds_kernel_window = get_lds_kernel_window(lds_kernel, lds_ks, lds_sigma)
        print(f'Using LDS: [{lds_kernel.upper()}] ({lds_ks}/{lds_sigma})')
        # apply kernel to the reweighted values
        smoothed_value = convolve1d(
            np.asarray([v for _, v in value_dict.items()]), weights=lds_kernel_window, mode='constant')
        value_dict_keys = list(value_dict.keys())
        num_per_label = [smoothed_value[value_dict_keys.index(label)] for label in labels]
    
    weights = [np.float32(1 / x) for x in num_per_label]
    scaling = len(weights) / np.sum(weights)
    weights = [scaling * x for x in weights]
    
    return torch.Tensor(weights)

TKassis avatar Jul 15 '21 12:07 TKassis

this is great, thank you!

vrodriguezf avatar Jul 15 '21 21:07 vrodriguezf

Hi @TKassis,

Thanks for creating this feature request. I've taken a look at the paper and I think it's great! I'd love to add this functionality to tsai asap. (I've just actually been working on a regression problem for a company where the target wasn't evenly distributed. This created some issues that this approach might be able to overcome.) I think it'd be good to add it to the library and create a tutorial notebook explaining how this new functionality works in conjunction with WeightedPerSampleLoss. It's shouldn't be too difficult. Would any of you @TKassis or @vrodriguezf be interested in co-authoring such a notebook? (If not, don't worry. I'll create it anyway.)

oguiza avatar Jul 16 '21 19:07 oguiza

Yes, I can help with that. I'll put together an example notebook.

TKassis avatar Jul 20 '21 20:07 TKassis

Hi @TKassis,

Yes, I can help with that. I'll put together an example notebook.

Wow, that'd be great! Thanks a lot for that!!

Please, let me know how can I help you with the notebook and code. In the meantime, I'll review WeightedPerSampleLoss to try and make it simpler to use sample_weights.

oguiza avatar Jul 26 '21 18:07 oguiza

Hi, I just want to provide an update on this long-standing issue (enhancement). I've reviewed the code @TKassis proposed and have found a few issues with it. I've then refactored it to ensure it would be usable with any type of regression label. For example, the code submitted above had some hard-coded values (5 & 1000) that are not applicable to every problem. I think that using a number of bins makes it easier to use. I've found a way I believe works well. Based on that I've decided to add both get_lds_kernel_window and prepare_weights functions to tsai. I've also updated WeightedPerSampleLoss to work with this type of data (only the train loss will be modified). So if you are interested feel free to test the new functionality. You'll find it here.

oguiza avatar Nov 25 '21 17:11 oguiza

This is exciting Ignacio, thank you so much!!!

vrodriguezf avatar Nov 26 '21 10:11 vrodriguezf

Oh wow! Great work! I totally forgot about this notebook by the way!

TKassis avatar Nov 26 '21 14:11 TKassis

Thanks, @vrodriguezf and @TKassis! No wonder you forgot about the notebook after what it took me to upload the code 😃 BTW, I just @ mentioned you to let you know the code is available just in case you want to use it in the future, not to remind you of anything. I'll try to put something together soon. I'm interested in checking if the approach really improves performance.

oguiza avatar Nov 26 '21 15:11 oguiza