Tianyi Chen
Tianyi Chen
### Description #### Source code compilation and installation - [x] streaming c++ - [x] streaming python - [x] streaming java - [x] mobius python - [ ] mobius java ####...
Fix java test case for native issues.
### What changes were proposed in this pull request? 1. Implement ```CheckTrainingHangOperator``` based on XPU Timer metric. 2. Integrate context from ```JobManager``` in ```DiagnosisDataManager```. 3. Use limited ```Deque``` instead of...
### What changes were proposed in this pull request? 1. Add 'succeeded report' implement. 2. Add 'succeeded' flag for ```Node``` object. 3. Skip 'succeeded node' in 'noheartbeat' judgement. ### Why...
# Background Currently, DLRover uses the official Kubernetes Python client to interact with the Kubernetes API Server. This part of implementations are quite important because it involves managing the lifecycle...
### What changes were proposed in this pull request? 1. A POC framework definition, along with a basic RL solution implementation. This includes the entire process from RL job submission...
**Is your feature request related to a problem? Please describe.** It is now recommended to directly use the checkpoint implementation from the latest version of Megatron. The dlrover's integration with...
**Is your feature request related to a problem? Please describe.** Considering the use of gRPC involves significant dependency issues, and there is a degree of uncertainty when using gRPC in...
### What changes were proposed in this pull request? 1. Complete the documentation for the new architecture. 2. Update the homepage. 3. Update release-related information. ### Why are the changes...