text-generation-inference
text-generation-inference copied to clipboard
Async and Sync results in different generation
System Info
2023-06-15T04:27:53.010592Z INFO text_generation_launcher: Runtime environment: [30/661]
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: 5ce89059f8149eaf313c63e9ded4199670cd74bb
Docker label: sha-5ce8905
nvidia-smi:
Thu Jun 15 04:27:51 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.07 Driver Version: 515.65.07 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:10:00.0 Off | Off |
| N/A 34C P0 86W / 400W | 25302MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:16:00.0 Off | Off |
| N/A 30C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:49:00.0 Off | Off |
| N/A 31C P0 73W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:4D:00.0 Off | Off |
| N/A 31C P0 71W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:C5:00.0 Off | Off |
| N/A 34C P0 91W / 400W | 32900MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:CA:00.0 Off | Off |
| N/A 34C P0 92W / 400W | 33044MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:E3:00.0 Off | Off |
| N/A 33C P0 96W / 400W | 33044MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:E7:00.0 Off | Off |
| N/A 35C P0 89W / 400W | 32900MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
- I tested this with both LLaMa 7B and 65B (with max_concurrent_request=128). I tried with both Single A100 (80GB) setting and sharding on 8 A100s.
- When I send 50 or more async requests to the server, I can see that the generation results change slightly even when I give the following parameters
-
top_k=1
&do_sample=false
- Here's an example:
=> you can see that for the same input, "What is Deep Learning?" there are two or more different types of generations even when I turned off all the randomness.
Expected behavior
Aysnc and sync requests to have same generation results for the same prompt and parameters.
The funny thing is, when I send the same amount of requests synchronously, the generations are stable.
You can also see from the above image that the model even degenerates sometimes. This behavior happened when I overloaded the model with 100 async requests from two different user endpoints.
Basically, the model gets worse when I send more requests simultaneously
I'm guessing this has to do something with the continuous batching feature?
Async and sync use exact same functions in the backend. However, matrix multiplications kernels in mixed precision are not deterministic and can lead to difference in generations when the batch size increase. do_sample = False does not do anything when top_k is set. Sampling will be activated anyways. top_k = 1 might be the reeason of the weird behaviour.
Can you try to reproduce the error without top_k? Just using greedy decoding? My bet is that the multinomial is doing some weird things.
See: https://github.com/pytorch/pytorch/issues/48841 https://github.com/huggingface/transformers/issues/22979
Ok, when I tried this with a custom kernel it seems that the generation is stable (even with 128 async requests).
I couldn't reproduce the error. However, I tried this with a custom-built kernel (following the local installation steps) so it's not exactly same as the above environment. Let me try reproducing it with the Original environment to see if it's any different.
Here's one thing I can confirm though: the generations of using top_k=1
and top_k=None
are different for sure.
@jshin49 yes top_k=1
is not equivalent to greedy. Top k will sample from tokens with scores >= the kth highest score. This means that it could be choosing from more than k tokens if there is a tie for kth place, and in particular when k=1 it will sample randomly from all the tokens that are tied with the highest score. Greedy uses argmax which will deterministically choose the token that has the highest score and the lowest id.
Intuitively to me at least, this makes k=1
sampling with a fixed random seed preferable to greedy, since with greedy you can end up with an unintended bias towards tokens with lower ids.
Note that with 16 bits, such score collisions are quite common, especially with the larger vocab sizes.
Any update on this?
Note that with 16 bits, such score collisions are quite common, especially with the larger vocab sizes.
For the score collisions with 16bits, could you please give some examples, or relevant references? @njhill
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.