lorax Issue using adapter with large prompt + sharded

Following error occurred at request time:

CUDA error: an illegal memory access was encountered

Repro context:

Mixtral-8x7b
Adapter (rank 8)
Long prompt
Sharded (2+ GPUs)

Ideally would like a simple repro script using public HF model and adapter we can use to diagnose further.

cc @noyoshi

Feb 26 '24 22:02 tgaddair

Exact same model, same setup.

The model is sharded (2 A100s) and served with 2 adapters. An initial call without adapter is processed successfully, but after 1-2 calls the server will fail with the aforementioned error.

I haven't tried on a single GPU.

Feb 29 '24 21:02 lighteternal

Thanks for the additional info @lighteternal. I'll be curious to see if #303 can help with this issue by reducing kv cache size in exchange for setting aside more memory for adapters and their overhead. I will continue to find a good repro on my side this week.

Mar 04 '24 23:03 tgaddair

Thanks for the support! The latest docker image has solved CUDA errors in my case. However, I notice the following strange behaviour, regardless of the adapter-memory-fraction value:

When I run my deployment via the following command on 2 shards using 2 A100s, I get it working but there's a significant delay between JIT adapter swapping. This delay (usually 3-5 seconds) is itself longer than the streaming time.

!sudo docker run --gpus 'all' --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --num-shard $num_shard --sharded true --max-total-tokens 32000 --max-input-length 31999 --max-batch-prefill-tokens 31999 --adapter-memory-fraction 0.1 --quantize eetq

On the contrary, if I try to host Mixtral on a single A100 (1 shard) via the following command, I get an initial latency during adapter swapping for the first 3-4 queries, but then it goes away, as if the adapters are successfully cached.

!sudo docker run --gpus '"device=0"' --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --num-shard $num_shard --sharded false --max-total-tokens 32000 --max-input-length 31999 --max-batch-prefill-tokens 31999 --quantize eetq

For the former approach with 2 GPUs, I tried adapter-memory-fraction 0.1, 0.2 and 0.3 (I get a CUDA OOM error after that).

Am I misusing the adapter-memory-fraction parameter or missing something entirely? I would expect that more GPUs would allow for 'comfortable' caching. For now, the latter approach works well, but I am concerned in cases of scaling due to demand.

Thank you for this amazing framework. 🙏

Mar 10 '24 15:03 lighteternal

Hey @lighteternal, glad the adapter memory fraction helped!

Do you happen to know if the 2x A100s are connected over NVLink? If not, the slowness may be attributed to network latency between the devices, as multi-LoRA inference requires a good amount of cross-device communication when sharded across multiple GPUs.

For your tests, how many adapters are you running with and what are their ranks? In most cases, I would expect that we wouldn't be needing to offload adapters unless you're working with very large ranks or a great many adapters at once.

Mar 10 '24 21:03 tgaddair

@tgaddair It is using NVLink (NC A100 v4-series instance). 2 adapters, rank 32. They need to be called sequentially as part of a pipeline, therefore any latency in swapping directly impacts overall latency.

Admittedly, I didn't have this problem with CodeLlama adapters that also took advantage of both GPUs - however they might have been a lower rank.

Mar 10 '24 22:03 lighteternal

lorax lorax copied to clipboard

Issue using adapter with large prompt + sharded

lorax
lorax copied to clipboard