BLINK icon indicating copy to clipboard operation
BLINK copied to clipboard

About efficiency of the model

Open AOZMH opened this issue 3 years ago • 21 comments

Hi,

Thanks for the great repo, I enjoy a lot exploring it! However, when I tried to run the code in the "Use BLINK in your codebase" chapter in README, I found the speed of running the model relatively slow (in fast=False mode). To be more specific, when I execute "main_dense.run", the first stage of processing proceeded relatively slow (~2.5 seconds per item) while the later stage (printing "Evaluation") proceeded ~ 5 items per second. Also, I tried adding indices as below.

config = {
    ...
    "faiss_index": "flat",
    "index_path": models_path+"faiss_flat_index.pkl"
}

However, the performance of the first stage became even worse (~20 seconds per item). I'm wondering if I'm setting something wrong (especially for the faiss index) which resulted in the low speed. If there are any corrections/methods to speed up? Thanks for your help! (I'll post the performance logs below if needed!)

AOZMH avatar Apr 11 '21 05:04 AOZMH

Hi @AOZMH, Thanks for reporting. It is known that the (fast=False) mode is slow, however, the faiss index should be faster. Could you provide a code snippet? Thanks.

ledw avatar Apr 12 '21 16:04 ledw

Thanks for the follow-up, I'll provide my code snippet and the execution logs (time costs) shortly after!

AOZMH avatar Apr 13 '21 02:04 AOZMH

Using Biencoder+Crossencoder is much slow than Biencoder only.

wutaiqiang avatar May 06 '21 09:05 wutaiqiang

Since the transformer is O(n^2), we can infer that the T(crossencoder):T(biencoder)=(n+n)^2:2*n^2=2:1

wutaiqiang avatar May 06 '21 09:05 wutaiqiang

Roughly, we can know that T(cross+bi) : T(bi) = 3:1, ignoring the other process

wutaiqiang avatar May 06 '21 09:05 wutaiqiang

Roughly, we can know that T(cross+bi) : T(bi) = 3:1, ignoring the other process

  • Thanks for the theoretical analysis! That's a bit different from the case I encountered, in which the bi-encoder phase was quite slow and adding an index turned to be even slower.
  • Actually, when I read through the code, I found that the bi-encoder phase was completely running on CPU (including a transformer encoding of the sentence and a matrix multiplication to calculate the similarity scores of each entity), resulting in the slow execution compared with the cross-encoder which totally ran on GPU.
  • I also manually changed code to set the bi-encoder phase to run on GPU, which resulted in a GPU out-of-memory due to the LARGE multiplication (hidden_dim * num_entities). As far as I can tell, the execution requires at least 24GB GPU memory for fp32, which exceeds the limit of my 16GB P100.
  • To solve that, I tried to turn all calculations to fp16, which did not induce OOM, but resulted in a warning telling me that large matmul of fp16 had bugs (see this). Actually, the bi-encoder results of fp16 was also totally wrong, e.g. giving a few incorrect entities high scores.
  • Finally, I manually pruned the entities to fit in 16GB memory at fp32 and everything was fine, the relative time of bi-encoder and cross-encoder turned to be ~20ms and ~100ms, which is reasonable for me.

To wrap up, I conclude that:

  1. I guess it was the high memory cost that led the contributors to turn bi-encoder phase to CPU (which requires RAM instead of GPU memory), but that significantly harms performance, as we all know, transformers on CPU are weigh more slow than on GPU.
  2. To smoothly run BLINK fully on one GPU, I think we need a >24GB GPU memory, which I guess is not a scarcity for FAIR, but somehow poses difficulty for students like me :)
  3. What I still haven't solved is the slow execution of faiss-index, which was even slower than the pure-cpu bi-encoder execution. Maybe someone can give some additional comments?

Thanks for all the help! I'll be happy to follow any updates.

AOZMH avatar May 06 '21 13:05 AOZMH

Hi @AOZMH, Thanks for reporting. It is known that the (fast=False) mode is slow, however, the faiss index should be faster. Could you provide a code snippet? Thanks.

My code snippet was the same as the one in README, as shown below.

import blink.main_dense as main_dense
import argparse

models_path = "models/" # the path where you stored the BLINK models

config = {
    "test_entities": None,
    "test_mentions": None,
    "interactive": False,
    "top_k": 10,
    "biencoder_model": models_path+"biencoder_wiki_large.bin",
    "biencoder_config": models_path+"biencoder_wiki_large.json",
    "entity_catalogue": models_path+"entity.jsonl",
    "entity_encoding": models_path+"all_entities_large.t7",
    "crossencoder_model": models_path+"crossencoder_wiki_large.bin",
    "crossencoder_config": models_path+"crossencoder_wiki_large.json",
    "fast": False, # set this to be true if speed is a concern
    "output_path": "logs/" # logging directory
}

args = argparse.Namespace(**config)

models = main_dense.load_models(args, logger=None)

data_to_link = [ {
                    "id": 0,
                    "label": "unknown",
                    "label_id": -1,
                    "context_left": "".lower(),
                    "mention": "Shakespeare".lower(),
                    "context_right": "'s account of the Roman general Julius Caesar's murder by his friend Brutus is a meditation on duty.".lower(),
                },
                {
                    "id": 1,
                    "label": "unknown",
                    "label_id": -1,
                    "context_left": "Shakespeare's account of the Roman general".lower(),
                    "mention": "Julius Caesar".lower(),
                    "context_right": "'s murder by his friend Brutus is a meditation on duty.".lower(),
                }
                ]

_, _, _, _, _, predictions, scores, = main_dense.run(args, None, *models, test_data=data_to_link)

Also, I changed the config to below to add faiss-index.

config = {
    "test_entities": None,
    "test_mentions": None,
    "interactive": False,
    "top_k": 10,
    "biencoder_model": models_path+"biencoder_wiki_large.bin",
    "biencoder_config": models_path+"biencoder_wiki_large.json",
    "entity_catalogue": models_path+"entity.jsonl",
    "entity_encoding": models_path+"all_entities_large.t7",
    "crossencoder_model": models_path+"crossencoder_wiki_large.bin",
    "crossencoder_config": models_path+"crossencoder_wiki_large.json",
    "fast": False, # set this to be true if speed is a concern
    "output_path": "logs/", # logging directory
    "faiss_index": "flat",
    "index_path": models_path+"faiss_flat_index.pkl"
}

AOZMH avatar May 06 '21 13:05 AOZMH

Hey, how did you manually change biencoder to gpu? Could you share the snippets?

shahjui2000 avatar Jun 02 '21 09:06 shahjui2000

Hey, how did you manually change biencoder to gpu? Could you share the snippets?

You can simply revert this commented code to revert the transition to GPU (# .to(device) => .to(device)) and manually put the corresponding model input tensors to GPU & that should work.

It may need a few days for me to clean up my (experimental) codes so maybe you can give it a try given the aforementioned ideas; as far as I can recall, that requires <20 lines of code changes. Anyway, if you still have any problem please feel free to reply me & I'll try to embark on my code snippet.

AOZMH avatar Jun 02 '21 09:06 AOZMH

I am facing the same memory issue you did. Could you elaborate your third point - On how to lower memory usage?

shahjui2000 avatar Jun 02 '21 17:06 shahjui2000

I am facing the same memory issue you did. Could you elaborate your third point - On how to lower memory usage?

I tried two ways as follow:

  1. Prun the candidate entity set. This code refers to the vector representations of all entities in Wikipedia (of shape <num_candidates, hidden_dim>). Such matrix is loaded by the load_models function as utilized by README. So, you can arbitrarily prune that matrix to a smaller set of entities, which would eliminate the OOM. However, as you can see, that's a totally experimental approach just to test the running speed which may lose recall on entity linking (since some entities are excluded from the candidate set).
  2. Execute on FP16. You can turn all corresponding tensors (e.g. candidate_encodings, model_weights, input tensors, etc.) to half-precision floats (16 bit) by a = a.half(). However, as in this discussion, matrix multiplication on very large tensors may result in garbled results, which is exactly the case for me where the entity scores given by the bi-encoder is totally wrong. Maybe you can try out some later versions of CUDA drivers to solve that issue given my relatively outdated CUDA 10.0, fingers crossed for your new findings!

AOZMH avatar Jun 03 '21 04:06 AOZMH

BTW, I really hope the developers of BLINK can look into this issue to solve the faiss index problem I mentioned before, thanks in advance!

AOZMH avatar Jun 03 '21 04:06 AOZMH

I wonder if this is also a feasible solution - Splitting the candidate_encoding when you pass it to gpu and then concatenating the splittled scores and continuing with the code? That way memory passed onto the gpu at each call is reduced without removing entities

shahjui2000 avatar Jun 03 '21 14:06 shahjui2000

I wonder if this is also a feasible solution - Splitting the candidate_encoding when you pass it to gpu and then concatenating the splittled scores and continuing with the code? That way memory passed onto the gpu at each call is reduced without removing entities

That should work properly, but the time cost would be considerable: the basic assumption is that the whole candidiate_encoding cannot fit into the gpu memory, now, if you split it into A and B, still, you cannot simultaneously put it into gpu memory. Thus, in that sense, for each execution (instead of each model initialization), we need to first transfer A to gpu, execute on A, then we need to delete A from gpu memory and transfer B to gpu and then execute B & delete B from gpu memory. That is, such splitting approach requires a transfer between main memory and gpu memory for each EXECUTION, which would be costly.

However, that should be a good idea if you have multiple gpus, e.g. putting A and B PERMANENTLY on gpu 0 & 1 and then we do not need such per execution transition.

AOZMH avatar Jun 03 '21 15:06 AOZMH

I think a possible solution is to encode all the queries and all the candidates with GPU and save them. Then build a faiss index with cpu to find the nearest entities. Faiss is much more efficient with satisfying results. But this costs much efforts and means the pipeline will be reconstructed.

BTW, if you use one 32GB V100, the problem will not occur.

Jun-jie-Huang avatar Jun 18 '21 13:06 Jun-jie-Huang

Hey, how did you manually change biencoder to gpu? Could you share the snippets?

You can simply revert this commented code to revert the transition to GPU (# .to(device) => .to(device)) and manually put the corresponding model input tensors to GPU & that should work.

It may need a few days for me to clean up my (experimental) codes so maybe you can give it a try given the aforementioned ideas; as far as I can recall, that requires <20 lines of code changes. Anyway, if you still have any problem please feel free to reply me & I'll try to embark on my code snippet.

Hi @AOZMH, could you please share your snippets of using gpu for biencoder? I did that but the speed is still slow, maybe I did wrongly...

shixiao9941 avatar Jun 25 '21 17:06 shixiao9941

Hey, same for me too! Changing it for GPU was actually slower than that of CPU

shahjui2000 avatar Jun 26 '21 07:06 shahjui2000

Btw patches #89 and #90 might help. To enable GPU you can try edit your biencoder_wiki_large.json and crossencoder_wiki_large.json files to set no_cuda to false.

tomtung avatar Jul 23 '21 00:07 tomtung

Hi @AOZMH, can you please share how you converted the BLINK to work on FP16. I am getting errors. Thanks

rajharshiitb avatar Dec 25 '21 12:12 rajharshiitb

I met the same problem. when I add faiss index path, it becomes slower

amelieyu1989 avatar Oct 27 '22 03:10 amelieyu1989

Hi,

You may want to make some changes to the codebase and add support for more sparse indexes. Currently, BLINK codebase only supports flat indices.

I am currently using a sparse index OPQ32_768,IVF4096,PQ32x8 built on candidate encodings and the speed improvement is significant.

For e.g., this is what my faiss_indexer.py looks like.

This is how I load the models.

config = {
    "interactive": False,
    "fast": False,
    "top_k": 8,
    "biencoder_model": models_path + "biencoder_wiki_large.bin",
    "biencoder_config": models_path + "biencoder_wiki_large.json",
    "crossencoder_model": models_path + "crossencoder_wiki_large.bin",
    "crossencoder_config": models_path + "crossencoder_wiki_large.json",
    "entity_catalogue": models_path + "entities_aliases_with_ids.jsonl",
    "entity_encoding": models_path + "all_entities_aliases.t7",
    "faiss_index": "OPQ32_768,IVF4096,PQ32x8",
    "index_path": models_path + "index_opq32_768_ivf4096_pq32x8.faiss",
    "output_path": "logs/",  # logging directory
}

self.args = argparse.Namespace(**config)

logger.info("Loading BLINK model...")
self.models = main_dense.load_models(self.args, logger=logger)

abhinavkulkarni avatar Oct 27 '22 05:10 abhinavkulkarni