Liang Wang comments

Results 54 comments of


                                            Liang Wang

Large batch size when pretraining E5 models

The v2 models are pre-trained on larger text pair datasets, the network architecture and training recipes are the same.

Is this model only works for English texts. Long texts will be truncated to at most 512 tokens? [E5: Text Embeddings by Weakly-Supervised Contrastive Pre-training]

Yes, currently it only works for English. We'll release multilingual versions of text embeddings in the coming month (no guarantee about the timeline though), please stay tuned! Thanks, Liang

E5-model, dataset creation

* For BM25, we sample 200 negatives from the top-1000 retrieved passages. * For Retriever 2, we use top-200 samples from retriever 1 as hard negatives. * For Reranker, again,...

E5-model, dataset creation

If you want to use the released model without fine-tuning, then you should add "query: " and "passage: " prefix, otherwise it is optional.

> one last question, is there any plan to release also the cross encoder? @intfloat The cross encoder for ms-marco is available at [https://github.com/microsoft/unilm/tree/master/simlm#available-models](https://github.com/microsoft/unilm/tree/master/simlm#available-models), we do not plan to release...

E5 model finetune error.

Can you try adding `--label_names labels` to the launch command in `simlm/scripts/train_biencoder_marco.sh`? Our code base is tested with `transformers==4.15`, newer versions seem to have breaking changes for us. UPDATE: this...

Questions about the loss function

You can refer to the answer here: https://github.com/intfloat/SimKGC/issues/10#issuecomment-1296792284 The InfoNCE loss is basically a cross entropy loss but the labels are not pre-defined like in text classification.

Questions about the loss function

当然可以，把第二行`loss += ...`注释掉就行，但效果会下降一点

Questions about the loss function

> > 当然可以，把第二行`loss += ...`注释掉就行，但效果会下降一点 > > 请问我在训练过程中将第二行 Loss+=...注释之后，在测试过程中也需要注释backward_metrics = .....吗不需要，forward metrics 对应给定头实体和关系，预测尾实体，backward metrics 对应给定尾实体和关系，预测头实体。对大多数数据集来说，第一种预测任务更简单，所以metrics会好一些。

Additional description information during testing(inference)

I am not entirely clear about your question, but I guess you are asking whether the groundtruth tail entity will be leaked in the input during test stage? The link...