systems
systems copied to clipboard
Update operators keep tensors on GPU between ops (where possible)
- [ ] Create a numpy/cupy dispatch mechanism (like pandas/cudf in NVT)
- [ ] Apply DLpack to pass GPU tensors from Python back-end to other models
- [ ] Update FilterCandidates
- [ ] Update SoftmaxSampling
- [ ] Update Faiss and Feast ops to convert to GPU?
Depending on how this turns out, we may or may not find it worthwhile to add a graph optimizer to condense multiple operators into a single TritonPythonModel. It would still help us avoid the scheduling overhead associated with passing requests between models, but it might not be a big boost if it combining operators no longer helps us avoid GPU-CPU roundtrip conversions.