text-generation-inference Async and Sync results in different generation

System Info

2023-06-15T04:27:53.010592Z  INFO text_generation_launcher: Runtime environment:                                                                                                                        [30/661]
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: 5ce89059f8149eaf313c63e9ded4199670cd74bb
Docker label: sha-5ce8905
nvidia-smi:
Thu Jun 15 04:27:51 2023
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 515.65.07    Driver Version: 515.65.07    CUDA Version: 11.8     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100-SXM...  On   | 00000000:10:00.0 Off |                  Off |
   | N/A   34C    P0    86W / 400W |  25302MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   1  NVIDIA A100-SXM...  On   | 00000000:16:00.0 Off |                  Off |
   | N/A   30C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   2  NVIDIA A100-SXM...  On   | 00000000:49:00.0 Off |                  Off |
   | N/A   31C    P0    73W / 400W |      0MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   3  NVIDIA A100-SXM...  On   | 00000000:4D:00.0 Off |                  Off |
   | N/A   31C    P0    71W / 400W |      0MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   4  NVIDIA A100-SXM...  On   | 00000000:C5:00.0 Off |                  Off |
   | N/A   34C    P0    91W / 400W |  32900MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   5  NVIDIA A100-SXM...  On   | 00000000:CA:00.0 Off |                  Off |
   | N/A   34C    P0    92W / 400W |  33044MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   6  NVIDIA A100-SXM...  On   | 00000000:E3:00.0 Off |                  Off |
   | N/A   33C    P0    96W / 400W |  33044MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   7  NVIDIA A100-SXM...  On   | 00000000:E7:00.0 Off |                  Off |
   | N/A   35C    P0    89W / 400W |  32900MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+

   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

I tested this with both LLaMa 7B and 65B (with max_concurrent_request=128). I tried with both Single A100 (80GB) setting and sharding on 8 A100s.
When I send 50 or more async requests to the server, I can see that the generation results change slightly even when I give the following parameters

top_k=1 & do_sample=false
Here's an example: => you can see that for the same input, "What is Deep Learning?" there are two or more different types of generations even when I turned off all the randomness.

Expected behavior

Aysnc and sync requests to have same generation results for the same prompt and parameters.

The funny thing is, when I send the same amount of requests synchronously, the generations are stable.

You can also see from the above image that the model even degenerates sometimes. This behavior happened when I overloaded the model with 100 async requests from two different user endpoints.

Basically, the model gets worse when I send more requests simultaneously

I'm guessing this has to do something with the continuous batching feature?

Jun 15 '23 13:06 jshin49

Async and sync use exact same functions in the backend. However, matrix multiplications kernels in mixed precision are not deterministic and can lead to difference in generations when the batch size increase. do_sample = False does not do anything when top_k is set. Sampling will be activated anyways. top_k = 1 might be the reeason of the weird behaviour.

Jun 16 '23 08:06 OlivierDehaene

Can you try to reproduce the error without top_k? Just using greedy decoding? My bet is that the multinomial is doing some weird things.

See: https://github.com/pytorch/pytorch/issues/48841 https://github.com/huggingface/transformers/issues/22979

Jun 16 '23 10:06 OlivierDehaene

Ok, when I tried this with a custom kernel it seems that the generation is stable (even with 128 async requests).

I couldn't reproduce the error. However, I tried this with a custom-built kernel (following the local installation steps) so it's not exactly same as the above environment. Let me try reproducing it with the Original environment to see if it's any different.

Here's one thing I can confirm though: the generations of using top_k=1 and top_k=None are different for sure.

Jun 19 '23 08:06 jshin49

@jshin49 yes top_k=1 is not equivalent to greedy. Top k will sample from tokens with scores >= the kth highest score. This means that it could be choosing from more than k tokens if there is a tie for kth place, and in particular when k=1 it will sample randomly from all the tokens that are tied with the highest score. Greedy uses argmax which will deterministically choose the token that has the highest score and the lowest id.

Intuitively to me at least, this makes k=1 sampling with a fixed random seed preferable to greedy, since with greedy you can end up with an unintended bias towards tokens with lower ids.

Note that with 16 bits, such score collisions are quite common, especially with the larger vocab sizes.

Jun 19 '23 22:06 njhill

Any update on this?

Sep 19 '23 14:09 muhammad-asn

Note that with 16 bits, such score collisions are quite common, especially with the larger vocab sizes.

For the score collisions with 16bits, could you please give some examples, or relevant references? @njhill

Feb 27 '24 06:02 syGOAT

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jul 24 '24 01:07 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Async and Sync results in different generation

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard