elyra Enable user to provision shared memory for pipeline node

Is your feature request related to a problem? Please describe. Using FARM to manage PyTorch KPI training and extraction, runs would deadlock: https://github.com/os-climate/aicoe-osc-demo/issues/174

I solved this problem by limiting the use of multiprocessing so that shared memory did not need to be allocated. But a better solution would be to allocate sufficient shared memory to allow the runs to complete. The Elyra Pipeline has an elegant way for users to specify how many CPUs and GPUs, as well as how much RAM should be allocated to a pipeline node. That pattern could be applied to other resources, like shared memory. Docker run supports the --shm=SIZE parameter, for example.

Describe the solution you'd like I'd like to specify on a per-node basis a non-default amount of shared memory to allocate. In my case, I'd like to see whether 512mb is enough, or whether 1gb or 2gb are needed. I should have the freedom do specify an amount (possibly with units).

Describe alternatives you've considered I have already implemented changes to disable multiprocessing, but this makes poor use of the powerful CPUs our cluster makes available. Another possibility would be to write Operate First scripts to control the Kubeflow execution parameters outside of Elyra, but why not expose this parameter that is so critical to specific node tasks?

Additional context My own project repo is here: https://github.com/MichaelTiemannOSC/aicoe-osc-demo/tree/cdp-fixups

@Shreyanand @ptitzler

Jul 18 '22 21:07 MichaelTiemannOSC

After quick research it appears that Kubernetes currently does not support setting the pod's shared memory size (https://github.com/kubernetes/kubernetes/issues/28272). Using an emptyDir as seen here https://docs.openshift.com/online/pro/dev_guide/shared_memory.html in combination with a size limit might be a possible approach that could be considered.

Jul 19 '22 16:07 ptitzler

Thank you! We are investigating and will report back.

Jul 21 '22 11:07 MichaelTiemannOSC

If there's any follow-up required on our end please re-open the issue!

Aug 11 '22 12:08 ptitzler

Hi @ptitzler, for this issue, I tried the following:

Exported the pipeline yaml (it's a very useful and powerful feature!)
Edited the yaml as per this comment that essentially mounts a shm dir and allows for a larger shared memory for multiprocessing
Imported the yaml in a kubeflow pipeline run

The pipeline ran successfully without any deadlocks. The increased shm size would help deep learning workloads that use multi processing. Could that be possibly added to node properties in the UI and the kf workload yaml in the backend?

Sep 21 '22 15:09 Shreyanand

elyra elyra copied to clipboard

Enable user to provision shared memory for pipeline node

elyra
elyra copied to clipboard