Wu Yi issues

Results 53 issues of


Wu Yi

paddle cloud should set OMPI_COMM_WORLD_SIZE env for pserver

Paddle v1 pserver uses OMPI_COMM_WORLD_SIZE for cost averaging: `paddle/pserver/ParameterServer2.cpp:101`, so paddle cloud should set this env for v1 job to work.

Do we need paddlectl client once we have the kubernetes custom controller?

Once we have TPR/CRD declared resource: ```yaml apiVersion: paddlepaddle.org/v1 kind: TrainingJob metadata: name: job-1 spec: image: "paddlepaddle/paddlecloud-job" trainer: entrypoint: "python train.py" workspace: "/home/job-1/" min-instance: 3 max-instance: 6 resources: limits: alpha.kubernetes.io/nvidia-gpu:...

need be discussed

"publish" command doesn't check whether the file exists in the cloud

"publish" command doesn't check whether the file exists in the cloud, causing error published files can not download.

Web design

Add the web pages basic feature design for paddle cloud.

feature request

[Feature] Add serve command to start a serverless predict serving port

Can run `paddlecloud serve -model-path xxx -scale 100 -cpu 1 -memory 8Gi -entry "infer.py" to start a serverless URL endpoint for serve the model.

feature request

Able to submit to any environments

1. Configure submit backend in `~/.paddle/config`, like kubernetes, docker-copose, etc. 1. User run `paddlecloud submit` to submit a training job. 1. Paddle cloud server will dispatch the job to corresponding...

Document about paddle cloud environment variables

Like in https://github.com/PaddlePaddle/cloud/issues/297 We have many envs defined for jobs to use internally, need to document about that.

Record pfs upload and download requests for audit

Support gpu-manager vGPU resource for notebook

Support [gpu-manager](https://github.com/tkestack/gpu-manager) vGPU resource for jupyter notebook. Will add vGPU support for Kubeflow Pipeline later.

size/L

lifecycle/stale

ok-to-test

feat: add `memoryVendors` and arbitrary counts for notebook GPU resource limits

Add support for configuring arbitrary GPU resource types like https://github.com/4paradigm/k8s-vgpu-scheduler, https://github.com/tkestack/gpu-manager etc. Take using https://github.com/tkestack/gpu-manager as an example, after install `gpu-manager` the cluster should provide virtual GPU resource types `tencent.com/vcuda-core`...

size/L