FlagEmbedding
FlagEmbedding copied to clipboard
Retrieval and Retrieval-augmented LLMs
nvidia-5090 cuda12.8 python3.11.7 torch2.7.1+cuda128 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( [rank0]: Traceback (most recent call last): [rank0]: File "/data/miniconda3/envs/emb-ft/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2506, in _run_ninja_build [rank0]: subprocess.run( [rank0]: File...
### 微调脚本 torchrun --nproc_per_node 2 \ -m FlagEmbedding.finetune.embedder.encoder_only.m3 \ --model_name_or_path models/bge-m3 \ --cache_dir ./cache/model \ --train_data ./data/bge-emb.jsonl \ --cache_path ./cache/data \ --train_group_size 8 \ --query_max_len 512 \ --passage_max_len 512 \...
Hi, thanks for sharing this great work on CodeR: Towards A Generalist Code Embedding Model! I noticed that the repository currently provides datasets and evaluation scripts, but it doesn’t include...
由于wandb的网络问题,想迁移到[swanlab](https://docs.swanlab.cn/guide_cloud/integration/integration-huggingface-transformers.html)来记录实验结果,想知道有没有什么方式能够不改FlagEmbedding源代码的情况下实现?
你好 在测试中,我发现完全相同的文本间关联度,反而略低于语义高度相似但不完全一致的文本。例如: 文本 A:“资金流出金额位于 (0,25%] 区间的支出总金额” 文本 A 与自身的关联度为 0.99992736 文本 A与“资金流出金额位于 (0,25%] 区间的支出总笔数” 的关联度为 0.99992745(略高于前者) 基于此现象,我产生一个疑问:bge-reranker-large 模型在训练阶段,其数据集中是否未包含 “完全相同文本关联度为 1” 的样本?这是否是导致上述结果的原因? 附件中为相关测试的代码截图,供参考。
I was looking to use Landmark Embedding, but I cannot find the model in the repository. Is this repo updated with LMK model as well
我注意到代码中存在encoder-only和decoder-only两种方法,请问这两种方法在检索过程中的性能差异有没有具体的数值报告可以参考?
https://huggingface.co/datasets/JUNJIE99/VISTA_S2 When I process the data in this link, I use the following commands ``` cat images.tar.part* > images.tar tar -xvf images.tar ``` I successfully ran the first line of...