atlas [Fix] Retriever tokenization function in atlas.py needs correction

[Fix] Retriever tokenization function in atlas.py needs correction

Open silencio94 opened this issue 1 year ago • 1 comments

When the code runs, the maximum passage length becomes the smaller of the two variables, self.opt.text_maxlength and gpu_embedder_batch_size. By default, gpu_embedder_batch_size is set to 512, and if you run the code without modifying default option, most BERT-style dual encoders will work without issues (see line 74).

However, if you reduce gpu_embedder_batch_size to conserve GPU memory, unexpected results can occur without warning.

https://github.com/facebookresearch/atlas/blob/f8bec5c6024eeee5315ec25322942e5f62ab0eb8/src/atlas.py#L61-L89

So, it is recommended to modify line 74 as follows (as done in other parts of the code):

min(self.opt.text_maxlength, BERT_MAX_SEQ_LENGTH),

Sep 24 '23 13:09 silencio94

Indeed, I saw this in the code and I was wondering what was the logic behind it - if any! Would be useful if the authors could clarify!

Oct 26 '23 08:10 littlewine

atlas atlas copied to clipboard

[Fix] Retriever tokenization function in atlas.py needs correction

atlas
atlas copied to clipboard