Wu Yi

Results 53 issues of Wu Yi

Paddle v1 pserver uses OMPI_COMM_WORLD_SIZE for cost averaging: `paddle/pserver/ParameterServer2.cpp:101`, so paddle cloud should set this env for v1 job to work.

Once we have TPR/CRD declared resource: ```yaml apiVersion: paddlepaddle.org/v1 kind: TrainingJob metadata: name: job-1 spec: image: "paddlepaddle/paddlecloud-job" trainer: entrypoint: "python train.py" workspace: "/home/job-1/" min-instance: 3 max-instance: 6 resources: limits: alpha.kubernetes.io/nvidia-gpu:...

need be discussed

"publish" command doesn't check whether the file exists in the cloud, causing error published files can not download.

Add the web pages basic feature design for paddle cloud.

feature request

Can run `paddlecloud serve -model-path xxx -scale 100 -cpu 1 -memory 8Gi -entry "infer.py" to start a serverless URL endpoint for serve the model.

feature request

1. Configure submit backend in `~/.paddle/config`, like kubernetes, docker-copose, etc. 1. User run `paddlecloud submit` to submit a training job. 1. Paddle cloud server will dispatch the job to corresponding...

Like in https://github.com/PaddlePaddle/cloud/issues/297 We have many envs defined for jobs to use internally, need to document about that.

Support [gpu-manager](https://github.com/tkestack/gpu-manager) vGPU resource for jupyter notebook. Will add vGPU support for Kubeflow Pipeline later.

size/L
lifecycle/stale
ok-to-test

Add support for configuring arbitrary GPU resource types like https://github.com/4paradigm/k8s-vgpu-scheduler, https://github.com/tkestack/gpu-manager etc. Take using https://github.com/tkestack/gpu-manager as an example, after install `gpu-manager` the cluster should provide virtual GPU resource types `tencent.com/vcuda-core`...

size/L