Simon_CQK issues

Results 22 issues of


                                            Simon_CQK

[M5/Feature Request] Orchestrating Job Roles in DAG Scheduling Scheme.

In the practice of our prod environment, we found a portion of scenarios relies on scheduling job replicas in stages, otherwise there will be severe exceptions, for example: 1. for...

[summer of code] elastic training on kubedl

## Topic Description `elastic training` has become new fashion in distributed model training, both job-completion-time and average cluster utilization benefited from it, we'd explore a native elastic training solution on...

[summer of code] light-weighted traffic control for inference

## Topic Description for now, `istio` handles traffic distribution when inference serves multiple model versions with different traffic weight, however, it is more like using a sledgehammer to crack a...

[feature request] refactor CacheBackend to Dataset

**What would you like to be added**: renaming between CacheBackend and Dataset, CacheBackend couples with caching system semantics but Dataset only describes a set of data with higher-level abstraction, cache...

refactor

[Annual Review] KubeDL 2022 Annual Review

# KubeDL 2022 Annual Review ## Table of Contents ## Backgroud KubeDL is a suite of Kubernentes controllers that enable running machine learning workloads on Kubernentes, such as model training...

tag-runtime

annual review

[feature request] replace kaniko with buildkit to boost model build/push stage

**What would you like to be added**: For now, we use `kaniko` to build-push-delivery model files in Model/ModelVersion lifecycle, which is a use-friendly approach, however, `buildkit` (opeated by docker official)...

enhancement

[quality improvements] make all `go lint` inspections pass

**What would you like to be added**: 1. format & optimize code structures to pass go lint inspections. 2. add `go lint` check as a required step when running Makefile....

refactor

[feature request] infrastructure anomaly auto detection and avoid to schedule pods on abnormal nodes.

**What would you like to be added**: 1. collect anomalous pod states and events, discover abnormal nodes progressively 2. avoid to schedule pods on abnormal nodes. **Why is this needed**:...

[Bug] Job submit failed with actor died

### Search before asking - [X] I searched the [issues](https://github.com/ray-project/kuberay/issues) and found no similar issues. ### KubeRay Component ray-operator ### What happened + What you expected to happen I launched...

bug

feat: re-setup master when master role restarted or recreated and recover dataset

### Ⅰ. Describe what this PR does fix #3368 ### Ⅱ. Does this pull request fix one issue? fixes #XXXX ### Ⅲ. List the added test cases (unit test/integration test)...

needs-ok-to-test