elyra
elyra copied to clipboard
Enable user to provision shared memory for pipeline node
Is your feature request related to a problem? Please describe. Using FARM to manage PyTorch KPI training and extraction, runs would deadlock: https://github.com/os-climate/aicoe-osc-demo/issues/174
I solved this problem by limiting the use of multiprocessing so that shared memory did not need to be allocated. But a better solution would be to allocate sufficient shared memory to allow the runs to complete. The Elyra Pipeline has an elegant way for users to specify how many CPUs and GPUs, as well as how much RAM should be allocated to a pipeline node. That pattern could be applied to other resources, like shared memory. Docker run supports the --shm=SIZE parameter, for example.
Describe the solution you'd like I'd like to specify on a per-node basis a non-default amount of shared memory to allocate. In my case, I'd like to see whether 512mb is enough, or whether 1gb or 2gb are needed. I should have the freedom do specify an amount (possibly with units).
Describe alternatives you've considered I have already implemented changes to disable multiprocessing, but this makes poor use of the powerful CPUs our cluster makes available. Another possibility would be to write Operate First scripts to control the Kubeflow execution parameters outside of Elyra, but why not expose this parameter that is so critical to specific node tasks?
Additional context My own project repo is here: https://github.com/MichaelTiemannOSC/aicoe-osc-demo/tree/cdp-fixups
@Shreyanand @ptitzler
After quick research it appears that Kubernetes currently does not support setting the pod's shared memory size (https://github.com/kubernetes/kubernetes/issues/28272
). Using an emptyDir
as seen here https://docs.openshift.com/online/pro/dev_guide/shared_memory.html in combination with a size limit might be a possible approach that could be considered.
Thank you! We are investigating and will report back.
If there's any follow-up required on our end please re-open the issue!
Hi @ptitzler, for this issue, I tried the following:
- Exported the pipeline yaml (it's a very useful and powerful feature!)
- Edited the yaml as per this comment that essentially mounts a shm dir and allows for a larger shared memory for multiprocessing
- Imported the yaml in a kubeflow pipeline run
The pipeline ran successfully without any deadlocks. The increased shm size would help deep learning workloads that use multi processing. Could that be possibly added to node properties in the UI and the kf workload yaml in the backend?