Jiaxin Shan

Results 271 issues of Jiaxin Shan

Community is asking different WG to own their infra and community won't provide a common shared testing infra anymore. Sees kubeflow/testing#752 for more details. Pytorch migration works well and here's...

kind/feature

We see two exact same logs for same object. Actually one is from pod creation and the other one comes from services. ``` INFO[0006] Update on create function xgboostjob-operator create...

kind/bug
area/front-end

`there is an error for the input config` comes from https://github.com/kubeflow/xgboost-operator/blob/8a87df2ae33aa8a6b7939384bf600f6bf4d01321/pkg/controller/xgboostjob/xgboostjob_controller.go#L95 It declares a new kcfg and this value is not assigned to variable outside. ``` $ ./bin/manager {"level":"info","ts":1589504573.6995661,"logger":"entrypoint","msg":"setting up...

kind/bug
area/operator

``` /bin/bash -c /mnt/test-data-volume/kubeflow-xgboost-operator-presubmit-build-70-9aa764f-7905-1ef1/src/kubeflow/xgboost-operator//build_image.sh /mnt/test-data-volume/kubeflow-xgboost-operator-presubmit-build-70-9aa764f-7905-1ef1/src/kubeflow/xgboost-operator/Dockerfile gcr.io/kubeflow-ci/xgboost-operator v1.0 ``` In build_image.sh scripts, we only consume two arguments. https://github.com/kubeflow/xgboost-operator/blob/78f8cf50bb943247e038a8feb5a9f7e47d810d65/build_image.sh#L10-L12 If we add extra argument, it will `v1.0` will be assigned which may...

kind/bug

E2e test is down. Reason is straightforwad that server report 503 issue and I did some check and notice this has been tracked in torch community. As the patch is...

Docs changes should not trigger presubmit jobs. This help improve development efficiency and try to reduce testing infra cost.

area/engprod
kind/feature

[TorchElastic](https://pytorch.org/elastic/) enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner. Use cases: - Fault Tolerance: jobs that run on infrastructure where nodes get replaced...

kind/feature

kubeflow/common release a stable version 0.3.1 and we can migrate to use implementation of kubeflow/common. The change will be similar to change https://github.com/kubeflow/tf-operator/pull/1171. It would be better to resolve dependencies,...

kind/feature

Tensorflow and PyTorch uses branches rather than tags for dependency management. Since we may make some breaking changes in the repo. I would suggest to cut a release and tags...

kind/feature

### The Feature I am using LiteLLM to proxy request for different providers. ### Motivation, pitch I am using Volcano Engine internally https://www.volcengine.com/docs/82379/1133189#python and also OpenAI compatible services, I do...

enhancement