arena icon indicating copy to clipboard operation
arena copied to clipboard

How to set dshm size for training?

Open Andrew-Su-0718 opened this issue 1 year ago • 3 comments

When I submit a pytorchjob with arena, I could't find parameters related to shared memory size, which is very important for pytorch training.

The size is fixed to 2Gi.

...
    - mountPath: /dev/shm
      name: dshm
...
...
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
...

Can anyone know how to set dshm size?

Andrew-Su-0718 avatar Feb 29 '24 10:02 Andrew-Su-0718

When I submit a pytorchjob with arena, I could't find parameters related to shared memory size, which is very important for pytorch training.

The size is fixed to 2Gi.

...
    - mountPath: /dev/shm
      name: dshm
...
...
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
...

Can anyone know how to set dshm size?

OK. I find a workaround solution. Modified file /charts/pytorchjob/values.yaml :

shmSize: 2Gi

to

shmSize: 64Gi # or any value you want

Andrew-Su-0718 avatar Mar 01 '24 07:03 Andrew-Su-0718

Same issue

yanshui177 avatar Jun 20 '24 06:06 yanshui177

/assign

Syulin7 avatar Jun 20 '24 08:06 Syulin7