codeflare-sdk
codeflare-sdk copied to clipboard
Torchx support for local NVME drives not available (with PV/PVCs for local NVME drives) - feature required for expected performance with multi-node training
I'm running multi-node training of ResNet50 with Torchx, codeflare-sdk, MCAD on OCP 4.12.
I have a 3 node OCP 4.12 cluster, each node has one Nvidia GPU. Each of the 3 worker nodes has one local 2.9TB NVME drive and an associated PV and PVC.
[root@e23-h21-740xd ResNet]# oc get pv | grep 2980Gi local-pv-289604ff 2980Gi RWO Delete Bound default/dianes-amazing-pvc2 local-sc 18h local-pv-8006a340 2980Gi RWO Delete Bound default/dianes-amazing-pvc1 local-sc 4d1h local-pv-86bac87f 2980Gi RWO Delete Bound default/dianes-amazing-pvc0 local-sc 21h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dianes-amazing-pvc0 Bound local-pv-86bac87f 2980Gi RWO local-sc 20h dianes-amazing-pvc1 Bound local-pv-8006a340 2980Gi RWO local-sc 20h dianes-amazing-pvc2 Bound local-pv-289604ff 2980Gi RWO local-sc 18h
When I run the following python script "python3 python-multi-node-pvc.py"
[ ResNet]# cat python-multi-node-pvc.py
# Import pieces from codeflare-sdk
from codeflare_sdk.job.jobs import DDPJobDefinition
arg_list = [ "--train-dir=/init/tiny-imagenet-200/train", "--val-dir=/init/tiny-imagenet-200/val", "--log-dir=/init/tiny-imagenet-200", "--checkpoint-format=/init/checkpoint-{epoch}.pth.tar" ]
jobdef = DDPJobDefinition( name="resnet50", script="pytorch/pytorch_imagenet_resnet50.py", script_args=arg_list, scheduler_args={"namespace": "default"}, j="3x1", gpu=1, cpu=4, memMB=24000, image="quay.io/dfeddema/horovod", mounts=[['type=volume','src=dianes-amazing-pvc0','dst="/init"'],['type=volume','src=dianes-amazing-pvc1','dst="/init"'],['type=volume','src=dianes-amazing-pvc2','dst="/init"']] ) job = jobdef.submit()
I get the following error: AttributeError: 'list' object has no attribute 'partition'
If I specify only one of the local NVME drives as shown below:
[ ResNet]# cat test_resnet3_single_PVC.py
# Import pieces from codeflare-sdk
from codeflare_sdk.job.jobs import DDPJobDefinition
arg_list = [ "--train-dir=/init/tiny-imagenet-200/train", "--val-dir=/init/tiny-imagenet-200/val", "--log-dir=/init/tiny-imagenet-200", "--checkpoint-format=/init/checkpoint-{epoch}.pth.tar" ]
jobdef = DDPJobDefinition( name="resnet50", script="pytorch/pytorch_imagenet_resnet50.py", script_args=arg_list, scheduler_args={"namespace": "default"}, j="3x1", gpu=1, cpu=4, memMB=24000, image="quay.io/dfeddema/horovod", mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"'] ) job = jobdef.submit()
Then one pod starts up successfully and is assigned it's local volume. The other two pods are not scheduled because "0/3 nodes are available: 1 Insufficient nvidia.com/gpu, 2 nodes(s) had volume node affinity conflict. preemption 0/3 are available... etc etc"
So you can see that as expected the 2nd and 3rd pods could not be scheduled because "2 nodes(s) had volume node affinity conflict".
I need a way to specify the local NVME drive (and associated PVC) for each of the nodes in my cluster. The training data resides on these local NVME drives.
Thanks @dfeddema!
For the direct use of the TorchX CLI, I have tested formatting multiple mounts with the argument --mount type=volume,src=foo-pvc,dst="/foo",type=volume,src=bar-pvc,dst="/bar" .
@MichaelClifford can you help us understand the correct syntax for multiple mounts through the CodeFlare SDK DDPJobDefinition?
@Sara-KS I also tested the syntax you show above.
--mount type=volume,src=foo-pvc,dst="/data",type=volume,src=bar-pvc,dst="/data" and it did not work for me.
Here's what I used, that is similar to your example, and it failed (note: if I remove the quotes in the mount args I get errors):
# Import pieces from codeflare-sdk from codeflare_sdk.job.jobs import DDPJobDefinition
arg_list = [ "--train-dir=/init/tiny-imagenet-200/train", "--val-dir=/init/tiny-imagenet-200/val", "--log-dir=/init/tiny-imagenet-200", "--checkpoint-format=/init/checkpoint-{epoch}.pth.tar" ]
jobdef = DDPJobDefinition( name="resnet50", script="pytorch/pytorch_imagenet_resnet50.py", script_args=arg_list, scheduler_args={"namespace": "default"}, j="3x1", gpu=1, cpu=4, memMB=24000, image="quay.io/dfeddema/horovod", mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"'] ) job = jobdef.submit()
@dfeddema That formatting is for direct use of the TorchX CLI and it should include multiple mounts in the generated yaml. I recommend running it as a dryrun and sharing the output here. @MichaelClifford Is the following the appropriate change needed to switch to a dryrun mode with the CodeFlare SDK?
jobdef = DDPJobDefinition(
to
jobdef = DDPJobDefinition._dry_run(
@Sara-KS yes,
DDPJobDefinition._dry_run(cluster) will generate the dry_run output.
There is a parameter in the DDPJobDefiniton() that allows you to define mounts.
see https://github.com/project-codeflare/codeflare-sdk/blob/baec8585b2bd918becd030951bf43e3504d43ada/src/codeflare_sdk/job/jobs.py#L62C11-L62C11
And the syntax should be similar to how we handle `script_args'. So something like:
mounts = ['type=volume',
'src=foo-pvc',
'dst="/foo"',
'type=volume',
'src=bar-pvc',
'dst="/bar"']
nvm. sorry, @dfeddema just fully read your last comment and see that you still got errors with that approach.
Is there some way to get the yaml for this? I want to see how codeflare_sdk is specifying the local volumes when I specify them like this in the jobdef:
mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"']
This syntax, DDPJobDefinition._dry_run(cluster), doesn't help me because I'm not using a Ray cluster. When I try to use dry run as shown below I get error "TypeError: _dry_run() got an unexpected keyword argument 'name'"
# Import pieces from codeflare-sdk
from codeflare_sdk.job.jobs import DDPJobDefinition
arg_list = [
"--train-dir=/init/tiny-imagenet-200/train",
"--val-dir=/init/tiny-imagenet-200/val",
"--log-dir=/init/tiny-imagenet-200",
"--checkpoint-format=/init/checkpoint-{epoch}.pth.tar"
]
jobdef = DDPJobDefinition._dry_run(
name="resnet50",
script="pytorch/pytorch_imagenet_resnet50.py",
script_args=arg_list,
scheduler_args={"namespace": "default"},
j="3x1",
gpu=1,
cpu=4,
memMB=24000,
image="quay.io/dfeddema/horovod",
mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"']
)
job = jobdef.submit()
Since you are not using a Ray cluster for this I think you need to do the following to see the dry_run output.
jobdef = DDPJobDefinition(name="resnet50",
script="pytorch/pytorch_imagenet_resnet50.py",
script_args=arg_list,
scheduler_args={"namespace": "default"},
j="3x1",
gpu=1,
cpu=4,
memMB=24000,
image="quay.io/dfeddema/horovod",
mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"']
)
jobdef._dry_run_no_cluster()
@MichaelClifford I tried your example above and it didn't produce any output. Maybe I need to import a module that generates this dry run? I see
from torchx.specs import AppDryRunInfo, AppDef , that's not the right module, but is there something similar I'm missing?Is there a DryRun module I need to import?
@MichaelClifford @Sara-KS
Thanks Sarah for your suggestion to use type=device. fully qualified path for the device works.
mounts=['type=device','src=nvme4n1','dst="/init", "perm=rw"']
with a slight tweak works. At first I was getting this error (when I used 'src=nvme4n1') :
Error: failed to mkdir nvme4n1: mkdir nvme4n1: operation not permitted
I tried the mkfs from the node and it works if you specify /dev/nvme4n1.
oc debug node/e27-h13-r750
mkfs.xfs -f /dev/nvme4n1
This works. Three pods are created and distributed model training runs as expected:
# Import pieces from codeflare-sdk from codeflare_sdk.job.jobs import DDPJobDefinition
arg_list = [ "--train-dir=/init/tiny-imagenet-200/train", "--val-dir=/init/tiny-imagenet-200/val", "--log-dir=/init/tiny-imagenet-200", "--checkpoint-format=/init/checkpoint-{epoch}.pth.tar" ]
jobdef = DDPJobDefinition( name="resnet50", script="pytorch/pytorch_imagenet_resnet50.py", script_args=arg_list, scheduler_args={"namespace": "default"}, j="3x1", gpu=1, cpu=4, memMB=24000, image="quay.io/dfeddema/horovod", mounts=['type=device','src=/dev/nvme4n1','dst="/init", "perm=rw"'] ) job = jobdef.submit()
The type=device approach, mounts=['type=device','src=/dev/nvme4n1','dst="/init", appeared to work but did not actually solve the problem.
There was a RUN mkdir /init in my container image (dockerfile) that was creating the directory where the training data was copied by this command which followed COPY tiny-imagenet-200 /init/tiny-imagenet-200. Hence, the test appeared to work but mount showed that /init was mounted on devtmpfs.
root@resnet50-d005br4gsdvjw-2:/horovod/examples# mount | grep init
devtmpfs on /"/init", "perm=rw" type devtmpfs (rw,nosuid,seclabel,size=131726212k,nr_inodes=32931553,mode=755)
This shows that no filesystem is mounted on /dev/nvme4n1 (the local nvme drive that we specified where /init should have been mounted.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 447.1G 0 disk
|-sda1 8:1 0 1M 0 part
|-sda2 8:2 0 127M 0 part
|-sda3 8:3 0 384M 0 part
`-sda4 8:4 0 446.6G 0 part /dev/termination-log
sr0 11:0 1 1.1G 0 rom
nvme2n1 259:0 0 1.5T 0 disk
nvme3n1 259:1 0 1.5T 0 disk
nvme4n1 259:2 0 2.9T 0 disk
nvme5n1 259:3 0 1.5T 0 disk
nvme1n1 259:4 0 1.5T 0 disk
nvme0n1 259:5 0 1.5T 0 disk
PVCs, specified in this way, mounts=['type=volume','src=dianes-amazing-pvc0','dst=/init', 'type=volume','src=dianes-amazing-pvc1','dst=/init', 'type=volume','src=dianes-amazing-pvc2','dst=/init',] , provide a more general solution to the problem than the mounts=['type=device','src=/dev/nvme4n1','dst="/init" solution.
If you generate an appwrapper that repeats this section, for each of the PVCs (e.g dianes-amazing-pvc0, dianes-amazing-pvc1, dianes-amazing-pvc2).
volumes:
- emptyDir:
medium: Memory
name: dshm
- name: mount-0
persistentVolumeClaim:
claimName: dianes-amazing-pvc0
(edited)
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /init
name: mount-0
readOnly: false
I think we would have what we need for this multi-node training run with local NVME drives.
Note: All of the pods need to mount /init on each of the local nvme drives, because the code is identical on each node (same copy of ResNet50) The current syntax requires that each filesystem to mount has a different name, e.g. /init0, /init1, /init2.
I manually created appwrapper yaml which allows me to have the same training code on each node that accesses the training data in /init, which is mounted on a local nvme drive. This solution works but is not ideal, because I am required to copy the data into /init once the job is running - I can't pre-stage it to the nvme drives before each run.