Attention not working properly in FlashRobertaModel and FlashBertModel
System Info
Operating System
Distributor ID: Ubuntu Description: Ubuntu 20.04.6 LTS Release: 20.04
Hardware used
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1B.0 Off | 0 |
| 0% 23C P8 36W / 300W | 1MiB / 23028MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A10G Off | 00000000:00:1C.0 Off | 0 |
| 0% 18C P8 15W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A10G Off | 00000000:00:1D.0 Off | 0 |
| 0% 18C P8 16W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 18C P8 16W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
- Set the model to
BAAI/bge-base-en-v1.5 - Set the volume to
volume=$PWD/data - Run lorax with
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:bd92e52 --model-id $model --max-input-length=512 - Run one example
curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: who has the most instagram followers on instagram"}' -H 'Content-Type: application/json' - Run second example:
curl localhost:8080/embed -X POST -d '{"inputs": "Represent this sentence for searching relevant passages: how many episodes in a season of stranger things"}' -H 'Content-Type: application/json' - Run the same queries with hugginface directly following instruction from: or https://huggingface.co/BAAI/bge-base-en-v1.5
Same thing applies for: https://huggingface.co/BAAI/bge-reranker-v2-m3
Expected behavior
The output embeddings of the two queries are exactly the same and much different from the embedding I get when using the same model from hugginface directly. The same applies for model: BAAI/bge-reranker-v2-m3 which is a RobertaModel, so Bert and Robert models seem to have the same issue.
I did a line by line debugging on your implementation and compared the outputs from each layer with the official hugginface implementation running the server locally, and the output of the attention in each layer seems completely different than the attention computed by hugginface team so I guess the issue is there: https://github.com/predibase/lorax/blob/c0e5798318a6b826572c612ddd4cf44621aa4add/server/lorax_server/models/custom_modeling/flash_bert_modeling.py#L165
and here: https://github.com/predibase/lorax/blob/c0e5798318a6b826572c612ddd4cf44621aa4add/server/lorax_server/models/custom_modeling/flash_roberta_modeling.py#L98