tsai
tsai copied to clipboard
New Feature: Label Distribution Smoothing (LDS) for Regression Targets
I wanted to submit a pull request to add this, but not sure where to put it. I propose adding Label Distribution Smoothing (LDS) recently presented by Yang et. al at ICML 2021. The project website can be found here along with code and video talk.
I propose adding a function prepare_weights that takes in the regression labels as input and outputs corresponding weights. The weights can then be presented to the WeightedPerSampleLoss() callback. The function, taken from Yang's repo with slight modifications to accept float targets, would look like this:
import torch
import torch.nn.functional as F
import numpy as np
from scipy.ndimage import gaussian_filter1d
from scipy.signal.windows import triang
from scipy.ndimage import convolve1d
def get_lds_kernel_window(kernel, ks, sigma):
assert kernel in ['gaussian', 'triang', 'laplace']
half_ks = (ks - 1) // 2
if kernel == 'gaussian':
base_kernel = [0.] * half_ks + [1.] + [0.] * half_ks
kernel_window = gaussian_filter1d(base_kernel, sigma=sigma) / max(gaussian_filter1d(base_kernel, sigma=sigma))
elif kernel == 'triang':
kernel_window = triang(ks)
else:
laplace = lambda x: np.exp(-abs(x) / sigma) / (2. * sigma)
kernel_window = list(map(laplace, np.arange(-half_ks, half_ks + 1))) / max(map(laplace, np.arange(-half_ks, half_ks + 1)))
return kernel_window
def prepare_weights(labels, reweight, lds=True, lds_kernel='gaussian', lds_ks=5, lds_sigma=2):
assert reweight in {'none', 'inverse', 'sqrt_inv'}
assert reweight != 'none' if lds else True, \
"Set reweight to \'sqrt_inv\' (default) or \'inverse\' when using LDS"
value_dict = {x: 0 for x in list(set(labels))} # initialize value dictionary with labels as keys
for label in labels:
value_dict[label] += 1 # increment counts of labels which occur multiple times
if reweight == 'sqrt_inv':
value_dict = {k: np.sqrt(v) for k, v in value_dict.items()}
elif reweight == 'inverse':
value_dict = {k: np.clip(v, 5, 1000) for k, v in value_dict.items()} # clip weights for inverse re-weight
num_per_label = [value_dict[label] for label in labels]
if not len(num_per_label) or reweight == 'none':
return None
print(f"Using re-weighting: [{reweight.upper()}]")
if lds:
lds_kernel_window = get_lds_kernel_window(lds_kernel, lds_ks, lds_sigma)
print(f'Using LDS: [{lds_kernel.upper()}] ({lds_ks}/{lds_sigma})')
# apply kernel to the reweighted values
smoothed_value = convolve1d(
np.asarray([v for _, v in value_dict.items()]), weights=lds_kernel_window, mode='constant')
value_dict_keys = list(value_dict.keys())
num_per_label = [smoothed_value[value_dict_keys.index(label)] for label in labels]
weights = [np.float32(1 / x) for x in num_per_label]
scaling = len(weights) / np.sum(weights)
weights = [scaling * x for x in weights]
return torch.Tensor(weights)
this is great, thank you!
Hi @TKassis,
Thanks for creating this feature request.
I've taken a look at the paper and I think it's great! I'd love to add this functionality to tsai
asap. (I've just actually been working on a regression problem for a company where the target wasn't evenly distributed. This created some issues that this approach might be able to overcome.)
I think it'd be good to add it to the library and create a tutorial notebook explaining how this new functionality works in conjunction with WeightedPerSampleLoss. It's shouldn't be too difficult.
Would any of you @TKassis or @vrodriguezf be interested in co-authoring such a notebook? (If not, don't worry. I'll create it anyway.)
Yes, I can help with that. I'll put together an example notebook.
Hi @TKassis,
Yes, I can help with that. I'll put together an example notebook.
Wow, that'd be great! Thanks a lot for that!!
Please, let me know how can I help you with the notebook and code.
In the meantime, I'll review WeightedPerSampleLoss
to try and make it simpler to use sample_weights.
Hi, I just want to provide an update on this long-standing issue (enhancement). I've reviewed the code @TKassis proposed and have found a few issues with it. I've then refactored it to ensure it would be usable with any type of regression label. For example, the code submitted above had some hard-coded values (5 & 1000) that are not applicable to every problem. I think that using a number of bins makes it easier to use. I've found a way I believe works well. Based on that I've decided to add both get_lds_kernel_window and prepare_weights functions to tsai. I've also updated WeightedPerSampleLoss to work with this type of data (only the train loss will be modified). So if you are interested feel free to test the new functionality. You'll find it here.
This is exciting Ignacio, thank you so much!!!
Oh wow! Great work! I totally forgot about this notebook by the way!
Thanks, @vrodriguezf and @TKassis! No wonder you forgot about the notebook after what it took me to upload the code 😃 BTW, I just @ mentioned you to let you know the code is available just in case you want to use it in the future, not to remind you of anything. I'll try to put something together soon. I'm interested in checking if the approach really improves performance.