atlas
atlas copied to clipboard
[Fix] Retriever tokenization function in atlas.py needs correction
When the code runs, the maximum passage length becomes the smaller of the two variables, self.opt.text_maxlength
and gpu_embedder_batch_size
. By default, gpu_embedder_batch_size
is set to 512, and if you run the code without modifying default option, most BERT-style dual encoders will work without issues (see line 74).
However, if you reduce gpu_embedder_batch_size
to conserve GPU memory, unexpected results can occur without warning.
https://github.com/facebookresearch/atlas/blob/f8bec5c6024eeee5315ec25322942e5f62ab0eb8/src/atlas.py#L61-L89
So, it is recommended to modify line 74 as follows (as done in other parts of the code):
min(self.opt.text_maxlength, BERT_MAX_SEQ_LENGTH),
Indeed, I saw this in the code and I was wondering what was the logic behind it - if any! Would be useful if the authors could clarify!