Jiaxin Shan
Jiaxin Shan
Community is asking different WG to own their infra and community won't provide a common shared testing infra anymore. Sees kubeflow/testing#752 for more details. Pytorch migration works well and here's...
We see two exact same logs for same object. Actually one is from pod creation and the other one comes from services. ``` INFO[0006] Update on create function xgboostjob-operator create...
`there is an error for the input config` comes from https://github.com/kubeflow/xgboost-operator/blob/8a87df2ae33aa8a6b7939384bf600f6bf4d01321/pkg/controller/xgboostjob/xgboostjob_controller.go#L95 It declares a new kcfg and this value is not assigned to variable outside. ``` $ ./bin/manager {"level":"info","ts":1589504573.6995661,"logger":"entrypoint","msg":"setting up...
``` /bin/bash -c /mnt/test-data-volume/kubeflow-xgboost-operator-presubmit-build-70-9aa764f-7905-1ef1/src/kubeflow/xgboost-operator//build_image.sh /mnt/test-data-volume/kubeflow-xgboost-operator-presubmit-build-70-9aa764f-7905-1ef1/src/kubeflow/xgboost-operator/Dockerfile gcr.io/kubeflow-ci/xgboost-operator v1.0 ``` In build_image.sh scripts, we only consume two arguments. https://github.com/kubeflow/xgboost-operator/blob/78f8cf50bb943247e038a8feb5a9f7e47d810d65/build_image.sh#L10-L12 If we add extra argument, it will `v1.0` will be assigned which may...
E2e test is down. Reason is straightforwad that server report 503 issue and I did some check and notice this has been tracked in torch community. As the patch is...
Docs changes should not trigger presubmit jobs. This help improve development efficiency and try to reduce testing infra cost.
[TorchElastic](https://pytorch.org/elastic/) enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner. Use cases: - Fault Tolerance: jobs that run on infrastructure where nodes get replaced...
kubeflow/common release a stable version 0.3.1 and we can migrate to use implementation of kubeflow/common. The change will be similar to change https://github.com/kubeflow/tf-operator/pull/1171. It would be better to resolve dependencies,...
Tensorflow and PyTorch uses branches rather than tags for dependency management. Since we may make some breaking changes in the repo. I would suggest to cut a release and tags...
### The Feature I am using LiteLLM to proxy request for different providers. ### Motivation, pitch I am using Volcano Engine internally https://www.volcengine.com/docs/82379/1133189#python and also OpenAI compatible services, I do...