setfit hyperparameters to control how to handle long documents

hyperparameters to control how to handle long documents

Open turian opened this issue 2 years ago • 0 comments

It's common that one might want to use setfit for classifying documents that are longer than max_token_len.

There are several strategies for handling long documents, and the efficacy of each is data dependent:

Break the document up at max_token_length, possibly avoiding breaking word boundaries.
Optionally using a sliding window.
Keeping all the windows, or the first k-windows, or something fancier like finding the most "interesting" windows with respect to the overall corpus.

Then after embedding each window, different classification strategies are possible:

maxpool then predict
average then predict
predict then average

It would be great if these could approaches could be hyperparameters for validation + test.

For train, it might be easiest to insist the training max_token_len is in bounds, alternately the above strategies could be used too.

Related: https://github.com/UKPLab/sentence-transformers/issues/1673 https://github.com/UKPLab/sentence-transformers/issues/1333 https://github.com/UKPLab/sentence-transformers/issues/1166

Jul 21 '23 11:07 turian

setfit setfit copied to clipboard

hyperparameters to control how to handle long documents

setfit
setfit copied to clipboard