codeflare-sdk Torchx support for local NVME drives not available (with PV/PVCs for local NVME drives) - feature required for expected performance with multi-node training

I'm running multi-node training of ResNet50 with Torchx, codeflare-sdk, MCAD on OCP 4.12.

I have a 3 node OCP 4.12 cluster, each node has one Nvidia GPU. Each of the 3 worker nodes has one local 2.9TB NVME drive and an associated PV and PVC.

[root@e23-h21-740xd ResNet]# oc get pv | grep 2980Gi local-pv-289604ff 2980Gi RWO Delete Bound default/dianes-amazing-pvc2 local-sc 18h local-pv-8006a340 2980Gi RWO Delete Bound default/dianes-amazing-pvc1 local-sc 4d1h local-pv-86bac87f 2980Gi RWO Delete Bound default/dianes-amazing-pvc0 local-sc 21h

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dianes-amazing-pvc0 Bound local-pv-86bac87f 2980Gi RWO local-sc 20h dianes-amazing-pvc1 Bound local-pv-8006a340 2980Gi RWO local-sc 20h dianes-amazing-pvc2 Bound local-pv-289604ff 2980Gi RWO local-sc 18h

When I run the following python script "python3 python-multi-node-pvc.py"

[ ResNet]# cat python-multi-node-pvc.py # Import pieces from codeflare-sdk from codeflare_sdk.job.jobs import DDPJobDefinition

arg_list = [ "--train-dir=/init/tiny-imagenet-200/train", "--val-dir=/init/tiny-imagenet-200/val", "--log-dir=/init/tiny-imagenet-200", "--checkpoint-format=/init/checkpoint-{epoch}.pth.tar" ]

jobdef = DDPJobDefinition( name="resnet50", script="pytorch/pytorch_imagenet_resnet50.py", script_args=arg_list, scheduler_args={"namespace": "default"}, j="3x1", gpu=1, cpu=4, memMB=24000, image="quay.io/dfeddema/horovod", mounts=[['type=volume','src=dianes-amazing-pvc0','dst="/init"'],['type=volume','src=dianes-amazing-pvc1','dst="/init"'],['type=volume','src=dianes-amazing-pvc2','dst="/init"']] ) job = jobdef.submit()

I get the following error: AttributeError: 'list' object has no attribute 'partition'

If I specify only one of the local NVME drives as shown below:

[ ResNet]# cat test_resnet3_single_PVC.py # Import pieces from codeflare-sdk from codeflare_sdk.job.jobs import DDPJobDefinition

arg_list = [ "--train-dir=/init/tiny-imagenet-200/train", "--val-dir=/init/tiny-imagenet-200/val", "--log-dir=/init/tiny-imagenet-200", "--checkpoint-format=/init/checkpoint-{epoch}.pth.tar" ]

jobdef = DDPJobDefinition( name="resnet50", script="pytorch/pytorch_imagenet_resnet50.py", script_args=arg_list, scheduler_args={"namespace": "default"}, j="3x1", gpu=1, cpu=4, memMB=24000, image="quay.io/dfeddema/horovod", mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"'] ) job = jobdef.submit()

Then one pod starts up successfully and is assigned it's local volume. The other two pods are not scheduled because "0/3 nodes are available: 1 Insufficient nvidia.com/gpu, 2 nodes(s) had volume node affinity conflict. preemption 0/3 are available... etc etc"

So you can see that as expected the 2nd and 3rd pods could not be scheduled because "2 nodes(s) had volume node affinity conflict".
I need a way to specify the local NVME drive (and associated PVC) for each of the nodes in my cluster. The training data resides on these local NVME drives.

Jul 07 '23 19:07 dfeddema

Thanks @dfeddema!

For the direct use of the TorchX CLI, I have tested formatting multiple mounts with the argument --mount type=volume,src=foo-pvc,dst="/foo",type=volume,src=bar-pvc,dst="/bar" .

@MichaelClifford can you help us understand the correct syntax for multiple mounts through the CodeFlare SDK DDPJobDefinition?

Jul 07 '23 21:07 Sara-KS

@Sara-KS I also tested the syntax you show above. --mount type=volume,src=foo-pvc,dst="/data",type=volume,src=bar-pvc,dst="/data" and it did not work for me.
Here's what I used, that is similar to your example, and it failed (note: if I remove the quotes in the mount args I get errors):

# Import pieces from codeflare-sdk from codeflare_sdk.job.jobs import DDPJobDefinition

arg_list = [ "--train-dir=/init/tiny-imagenet-200/train", "--val-dir=/init/tiny-imagenet-200/val", "--log-dir=/init/tiny-imagenet-200", "--checkpoint-format=/init/checkpoint-{epoch}.pth.tar" ]

jobdef = DDPJobDefinition( name="resnet50", script="pytorch/pytorch_imagenet_resnet50.py", script_args=arg_list, scheduler_args={"namespace": "default"}, j="3x1", gpu=1, cpu=4, memMB=24000, image="quay.io/dfeddema/horovod", mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"'] ) job = jobdef.submit()

Jul 10 '23 17:07 dfeddema

@dfeddema That formatting is for direct use of the TorchX CLI and it should include multiple mounts in the generated yaml. I recommend running it as a dryrun and sharing the output here. @MichaelClifford Is the following the appropriate change needed to switch to a dryrun mode with the CodeFlare SDK?

jobdef = DDPJobDefinition(

to

jobdef = DDPJobDefinition._dry_run(

Jul 10 '23 17:07 Sara-KS

@Sara-KS yes, DDPJobDefinition._dry_run(cluster) will generate the dry_run output.

Jul 10 '23 19:07 MichaelClifford

There is a parameter in the DDPJobDefiniton() that allows you to define mounts.

see https://github.com/project-codeflare/codeflare-sdk/blob/baec8585b2bd918becd030951bf43e3504d43ada/src/codeflare_sdk/job/jobs.py#L62C11-L62C11

And the syntax should be similar to how we handle `script_args'. So something like:

mounts = ['type=volume',
'src=foo-pvc',
'dst="/foo"',
'type=volume',
'src=bar-pvc',
'dst="/bar"']

Jul 10 '23 19:07 MichaelClifford

nvm. sorry, @dfeddema just fully read your last comment and see that you still got errors with that approach.

Jul 10 '23 19:07 MichaelClifford

Is there some way to get the yaml for this? I want to see how codeflare_sdk is specifying the local volumes when I specify them like this in the jobdef: mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"']

This syntax, DDPJobDefinition._dry_run(cluster), doesn't help me because I'm not using a Ray cluster. When I try to use dry run as shown below I get error "TypeError: _dry_run() got an unexpected keyword argument 'name'"

# Import pieces from codeflare-sdk
from codeflare_sdk.job.jobs import DDPJobDefinition

arg_list = [
    "--train-dir=/init/tiny-imagenet-200/train",
    "--val-dir=/init/tiny-imagenet-200/val",
    "--log-dir=/init/tiny-imagenet-200",
    "--checkpoint-format=/init/checkpoint-{epoch}.pth.tar"
]

jobdef = DDPJobDefinition._dry_run(
    name="resnet50",
    script="pytorch/pytorch_imagenet_resnet50.py",
    script_args=arg_list,
    scheduler_args={"namespace": "default"},
    j="3x1",
    gpu=1,
    cpu=4,
    memMB=24000,
    image="quay.io/dfeddema/horovod",
    mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"']
)
job = jobdef.submit()

Jul 10 '23 23:07 dfeddema

Since you are not using a Ray cluster for this I think you need to do the following to see the dry_run output.

jobdef = DDPJobDefinition(name="resnet50",
    script="pytorch/pytorch_imagenet_resnet50.py",
    script_args=arg_list,
    scheduler_args={"namespace": "default"},
    j="3x1",
    gpu=1,
    cpu=4,
    memMB=24000,
    image="quay.io/dfeddema/horovod",
    mounts=['type=volume','src=dianes-amazing-pvc0','dst="/init"','type=volume','src=dianes-amazing-pvc1','dst="/init"','type=volume','src=dianes-amazing-pvc2','dst="/init"']
)

jobdef._dry_run_no_cluster()

Jul 11 '23 00:07 MichaelClifford

@MichaelClifford I tried your example above and it didn't produce any output. Maybe I need to import a module that generates this dry run? I see from torchx.specs import AppDryRunInfo, AppDef , that's not the right module, but is there something similar I'm missing?Is there a DryRun module I need to import?

Jul 11 '23 14:07 dfeddema

@MichaelClifford @Sara-KS Thanks Sarah for your suggestion to use type=device. fully qualified path for the device works. mounts=['type=device','src=nvme4n1','dst="/init", "perm=rw"'] with a slight tweak works. At first I was getting this error (when I used 'src=nvme4n1') : Error: failed to mkdir nvme4n1: mkdir nvme4n1: operation not permitted

I tried the mkfs from the node and it works if you specify /dev/nvme4n1.
oc debug node/e27-h13-r750 mkfs.xfs -f /dev/nvme4n1

This works. Three pods are created and distributed model training runs as expected:

# Import pieces from codeflare-sdk from codeflare_sdk.job.jobs import DDPJobDefinition

arg_list = [ "--train-dir=/init/tiny-imagenet-200/train", "--val-dir=/init/tiny-imagenet-200/val", "--log-dir=/init/tiny-imagenet-200", "--checkpoint-format=/init/checkpoint-{epoch}.pth.tar" ]

jobdef = DDPJobDefinition( name="resnet50", script="pytorch/pytorch_imagenet_resnet50.py", script_args=arg_list, scheduler_args={"namespace": "default"}, j="3x1", gpu=1, cpu=4, memMB=24000, image="quay.io/dfeddema/horovod", mounts=['type=device','src=/dev/nvme4n1','dst="/init", "perm=rw"'] ) job = jobdef.submit()

Jul 12 '23 00:07 dfeddema

The type=device approach, mounts=['type=device','src=/dev/nvme4n1','dst="/init", appeared to work but did not actually solve the problem.
There was a RUN mkdir /init in my container image (dockerfile) that was creating the directory where the training data was copied by this command which followed COPY tiny-imagenet-200 /init/tiny-imagenet-200. Hence, the test appeared to work but mount showed that /init was mounted on devtmpfs.

root@resnet50-d005br4gsdvjw-2:/horovod/examples# mount | grep init
devtmpfs on /"/init", "perm=rw" type devtmpfs (rw,nosuid,seclabel,size=131726212k,nr_inodes=32931553,mode=755)

This shows that no filesystem is mounted on /dev/nvme4n1 (the local nvme drive that we specified where /init should have been mounted.

# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda       8:0    0 447.1G  0 disk 
|-sda1    8:1    0     1M  0 part 
|-sda2    8:2    0   127M  0 part 
|-sda3    8:3    0   384M  0 part 
`-sda4    8:4    0 446.6G  0 part /dev/termination-log
sr0      11:0    1   1.1G  0 rom  
nvme2n1 259:0    0   1.5T  0 disk 
nvme3n1 259:1    0   1.5T  0 disk 
nvme4n1 259:2    0   2.9T  0 disk 
nvme5n1 259:3    0   1.5T  0 disk 
nvme1n1 259:4    0   1.5T  0 disk 
nvme0n1 259:5    0   1.5T  0 disk

PVCs, specified in this way, mounts=['type=volume','src=dianes-amazing-pvc0','dst=/init', 'type=volume','src=dianes-amazing-pvc1','dst=/init', 'type=volume','src=dianes-amazing-pvc2','dst=/init',] , provide a more general solution to the problem than the mounts=['type=device','src=/dev/nvme4n1','dst="/init" solution.

If you generate an appwrapper that repeats this section, for each of the PVCs (e.g dianes-amazing-pvc0, dianes-amazing-pvc1, dianes-amazing-pvc2).

         volumes:
          - emptyDir:
              medium: Memory
            name: dshm
          - name: mount-0
            persistentVolumeClaim:
              claimName: dianes-amazing-pvc0

(edited)

              volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /init
                name: mount-0
                readOnly: false

I think we would have what we need for this multi-node training run with local NVME drives.

Note: All of the pods need to mount /init on each of the local nvme drives, because the code is identical on each node (same copy of ResNet50) The current syntax requires that each filesystem to mount has a different name, e.g. /init0, /init1, /init2.

Jul 24 '23 20:07 dfeddema

I manually created appwrapper yaml which allows me to have the same training code on each node that accesses the training data in /init, which is mounted on a local nvme drive. This solution works but is not ideal, because I am required to copy the data into /init once the job is running - I can't pre-stage it to the nvme drives before each run.

Jul 25 '23 14:07 dfeddema

codeflare-sdk codeflare-sdk copied to clipboard

Torchx support for local NVME drives not available (with PV/PVCs for local NVME drives) - feature required for expected performance with multi-node training

codeflare-sdk
codeflare-sdk copied to clipboard