Attention not working properly in FlashRobertaModel and FlashBertModel

Open sgiorgis opened this issue 1 year ago • 0 comments

System Info

Operating System

Distributor ID: Ubuntu Description: Ubuntu 20.04.6 LTS Release: 20.04

Hardware used

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1B.0 Off |                    0 |
|  0%   23C    P8             36W /  300W |       1MiB /  23028MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    Off |   00000000:00:1C.0 Off |                    0 |
|  0%   18C    P8             15W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    Off |   00000000:00:1D.0 Off |                    0 |
|  0%   18C    P8             16W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   18C    P8             16W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Set the model to BAAI/bge-base-en-v1.5
Set the volume to volume=$PWD/data
Run lorax with docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:bd92e52 --model-id $model --max-input-length=512
Run one example curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: who has the most instagram followers on instagram"}' -H 'Content-Type: application/json'
Run second example: curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: how many episodes in a season of stranger things"}' -H 'Content-Type: application/json'
Run the same queries with hugginface directly following instruction from: or https://huggingface.co/BAAI/bge-base-en-v1.5

Same thing applies for: https://huggingface.co/BAAI/bge-reranker-v2-m3

Expected behavior

The output embeddings of the two queries are exactly the same and much different from the embedding I get when using the same model from hugginface directly. The same applies for model: BAAI/bge-reranker-v2-m3 which is a RobertaModel, so Bert and Robert models seem to have the same issue.

I did a line by line debugging on your implementation and compared the outputs from each layer with the official hugginface implementation running the server locally, and the output of the attention in each layer seems completely different than the attention computed by hugginface team so I guess the issue is there: https://github.com/predibase/lorax/blob/c0e5798318a6b826572c612ddd4cf44621aa4add/server/lorax_server/models/custom_modeling/flash_bert_modeling.py#L165

and here: https://github.com/predibase/lorax/blob/c0e5798318a6b826572c612ddd4cf44621aa4add/server/lorax_server/models/custom_modeling/flash_roberta_modeling.py#L98

Nov 22 '24 11:11 sgiorgis