setfit icon indicating copy to clipboard operation
setfit copied to clipboard

hyperparameters to control how to handle long documents

Open turian opened this issue 2 years ago • 0 comments

It's common that one might want to use setfit for classifying documents that are longer than max_token_len.

There are several strategies for handling long documents, and the efficacy of each is data dependent:

  • Break the document up at max_token_length, possibly avoiding breaking word boundaries.
  • Optionally using a sliding window.
  • Keeping all the windows, or the first k-windows, or something fancier like finding the most "interesting" windows with respect to the overall corpus.

Then after embedding each window, different classification strategies are possible:

  • maxpool then predict
  • average then predict
  • predict then average

It would be great if these could approaches could be hyperparameters for validation + test.

For train, it might be easiest to insist the training max_token_len is in bounds, alternately the above strategies could be used too.

Related: https://github.com/UKPLab/sentence-transformers/issues/1673 https://github.com/UKPLab/sentence-transformers/issues/1333 https://github.com/UKPLab/sentence-transformers/issues/1166

turian avatar Jul 21 '23 11:07 turian