I want to deploy three models, one large language model occupying one GPU, one embedding model and one re-ranking model sharing one GPU, How can I do it?
there are two gpu device on the kubenetes node, the timeSlicing.replicas is two, the nvidia.com/gpu of large langurage model is two, the nvidia.com/gpu of other models are one, but the pod of large langurage model has two gpu device
You can't do this with the standard device plugin. You will need to wait until DRA is available: https://docs.google.com/document/d/1BNWqgx_SmZDi-va_V31v3DnuVwYnF2EmN7D-O_fB6Oo/edit#heading=h.bxuci8gx6hna
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
the DRA cannot be used in production environments. Can we modify the code of GPU allocation strategy to implement it? Can anyone give me some help?
the DRA cannot be used in production environments. Can we modify the code of GPU allocation strategy to implement it? Can anyone give me some help?
example: used timeslicing,replicas set 2, the pod limits set 2,assign a gpu,set the other two pods to 1 and assign them to the same gpu