Nicolas Patry
Nicolas Patry
Did you properly set ` --shm-size 1g ` ?
Not really 60s for cross GPU communication is really A LOT. Here allowing for a longer timeout will not help, since the cards just cannot communicate.
Created a PR for it.
Take example to other models we have done in `server/text-generation-server/models/custom_modeling/*.py` maybe ? There's also some files in `server/text-generation-server/models/*.py`. Those are declaring the model as being flash enabled (the batching happens...
It's supported on the "best effort basis". I started some work to actually support it, but it means rewriting flash attention (the cuda version) with added bias, which may take...
> on implementing dynamic batching for this as it only supports 1 concurrent request for now on AutoModel. This won't require work once we have flash attention.
Because it doesn't implement the flash attention we want. This is Triton's flash attention, which doesn't support "unpadded" batching, which is the one necessary to work nicely on TGI (removing...
Here is the non flash version (as a temporary measure since modifying the kernel is taking more time than I anticipated: https://github.com/huggingface/text-generation-inference/pull/514 This should enable sharding at least.
https://github.com/huggingface/text-generation-inference/pull/514 should make requiring TRUST_REMOTE_CODE not necessary anymore.
I will close this issue since it seems to be solved. For `tiiuae/falcon-rw-1b` feel free to open an issue with the env and stacktrace so we can look into fixing...