Simon_CQK
Simon_CQK
In the practice of our prod environment, we found a portion of scenarios relies on scheduling job replicas in stages, otherwise there will be severe exceptions, for example: 1. for...
## Topic Description `elastic training` has become new fashion in distributed model training, both job-completion-time and average cluster utilization benefited from it, we'd explore a native elastic training solution on...
## Topic Description for now, `istio` handles traffic distribution when inference serves multiple model versions with different traffic weight, however, it is more like using a sledgehammer to crack a...
**What would you like to be added**: renaming between CacheBackend and Dataset, CacheBackend couples with caching system semantics but Dataset only describes a set of data with higher-level abstraction, cache...
# KubeDL 2022 Annual Review ## Table of Contents ## Backgroud KubeDL is a suite of Kubernentes controllers that enable running machine learning workloads on Kubernentes, such as model training...
**What would you like to be added**: For now, we use `kaniko` to build-push-delivery model files in Model/ModelVersion lifecycle, which is a use-friendly approach, however, `buildkit` (opeated by docker official)...
**What would you like to be added**: 1. format & optimize code structures to pass go lint inspections. 2. add `go lint` check as a required step when running Makefile....
**What would you like to be added**: 1. collect anomalous pod states and events, discover abnormal nodes progressively 2. avoid to schedule pods on abnormal nodes. **Why is this needed**:...
### Search before asking - [X] I searched the [issues](https://github.com/ray-project/kuberay/issues) and found no similar issues. ### KubeRay Component ray-operator ### What happened + What you expected to happen I launched...
### Ⅰ. Describe what this PR does fix #3368 ### Ⅱ. Does this pull request fix one issue? fixes #XXXX ### Ⅲ. List the added test cases (unit test/integration test)...