setfit
                                
                                 setfit copied to clipboard
                                
                                    setfit copied to clipboard
                            
                            
                            
                        hyperparameters to control how to handle long documents
It's common that one might want to use setfit for classifying documents that are longer than max_token_len.
There are several strategies for handling long documents, and the efficacy of each is data dependent:
- Break the document up at max_token_length, possibly avoiding breaking word boundaries.
- Optionally using a sliding window.
- Keeping all the windows, or the first k-windows, or something fancier like finding the most "interesting" windows with respect to the overall corpus.
Then after embedding each window, different classification strategies are possible:
- maxpool then predict
- average then predict
- predict then average
It would be great if these could approaches could be hyperparameters for validation + test.
For train, it might be easiest to insist the training max_token_len is in bounds, alternately the above strategies could be used too.
Related: https://github.com/UKPLab/sentence-transformers/issues/1673 https://github.com/UKPLab/sentence-transformers/issues/1333 https://github.com/UKPLab/sentence-transformers/issues/1166