Shuai Xie
Shuai Xie
### Feature Area /area backend Currently, I can use the `kfp` Python SDK to delete runs by following codes. ```py from kfp_server_api.api.run_service_api import RunServiceApi run_api = RunServiceApi() def list_all_runs(sort_by='name'): list_runs_rsp...
### 是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this? - [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions ### 该问题是否在FAQ中有解答? | Is there an...
Hi, everyone. I want to test the failure tolerance of PytorchJob. I started a PytorchJob with 1 master and 3 workers. ```shell $ kubectl get pods -o wide NAME READY...
I try to figure out why Bare Metal (BM) and PytorchJob (PJ) have different training results in https://github.com/kubeflow/pytorch-operator/issues/354#issue-999999536. And now I find that PytorchJon v1.8.0 and 1.9.0 have different training...
Dear developers, I got a new problem. I've compared the DDP training process of PytorchJob (PJ) and Bare Metal (BM) and got different training results. ## Experiment settings - Two...
Hello. Dear developers, I find a problem when using pytorchjob. ## Problem I notice that **PytorchJob replica pods don't obey the scheduling rules set in the node affinity. All the...
Dear Developers: I'm deploying a GPT model with triton-inference-server and fastertransformer_backend, following this tutorial: https://github.com/triton-inference-server/fastertransformer_backend/blob/main/docs/gpt_guide.md#run-triton-server-on-multiple-nodes. I have successfully implemented the single-node deployment and conducted identity testing. However, as I moved...
### Environment * Kubeflow Pipelines Standalone on a local cluster. * KFP version: 1.7.0 * KFP SDK version: 1.8.2 ### Steps to reproduce Dear developers, I try to run the...