spartan Docker shared memory issue and solution

I am not sure if this is happening in our various other configurations, but it was happening in my spartan Docker container inside which I put PyTorch and was trying to do some training.

Symptom

I was getting an error something like, "Bus error (core dumped) model share memory". It's related to this issue: https://github.com/pytorch/pytorch/issues/2244

Cause

Following the comments by apaszke (a PyTorch author) are helpful here (https://github.com/pytorch/pytorch/issues/1355#issuecomment-308587289) in which, running inside the Docker container, it appears the only available shared memory is 64 megs:

peteflo@08482dc37efa:~$ df -h | grep shm
shm              64M     0   64M   0% /dev/shm

Temp Solution

As mentioned by apaszke,

sudo mount -o remount,size=8G /dev/shm

(choose more than 8G if you'd like)

This fixes it, as visible here:

peteflo@08482dc37efa:~$ df -h | grep shm
shm             8.0G     0  8.0G   0% /dev/shm

Other notes

Some places on the internet you will find that --ipc=host is supposed to avoid this issue, as can other flags to the docker run process, but those didn't work for me, and involve re-opening the container. I suspect something about my configuration is wrong. The above issue fixes it even while inside the container.

Long term solution

It would first be useful to identify if anybody else's docker containers have this issue, which can be simply evaluated by df -h | grep shm inside the container. Then we could diagnose who it is happening to and why. It might just be me.

Feb 14 '19 00:02 peteflorence

@manuelli @gizatt @weigao95

has anybody else seen this?

Feb 14 '19 00:02 peteflorence

Huh, nope, never seen this -- and I've done a little bit of work with Pytorch in Docker, too...

Would adding that patch command to the docker entrypoint (or build script) help?

On Wed, Feb 13, 2019, 19:08 Pete Florence <[email protected] wrote:

@manuelli https://github.com/manuelli @gizatt https://github.com/gizatt @weigao95 https://github.com/weigao95

has anybody else seen this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobotLocomotion/spartan/issues/369#issuecomment-463428522, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7LTxbloo-6OLXWb6c7GDPOZ-bOF9B2ks5vNKkAgaJpZM4a6hz_ .

Feb 14 '19 00:02 gizatt

Yes that would work but first would like to ascertain if anybody else has this issue.

I've done a lot of work with PyTorch in Docker before but haven't had this, so would like to understand what's different.

Is easy to test your own docker setup, just run:

df -h | grep shm

Feb 14 '19 00:02 peteflorence

Yeah, 64m here.

@d354535de71e:~/spartan$ df -h | grep shm shm 64M 0 64M 0% /dev/shm

On Wed, Feb 13, 2019, 19:13 Pete Florence <[email protected] wrote:

Yes that would work but first would like to ascertain if anybody else has this issue.

I've done a lot of work with PyTorch in Docker before but haven't had this, so would like to understand what's different.

Is easy to test your own docker setup, just run:

df -h | grep shm

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobotLocomotion/spartan/issues/369#issuecomment-463429790, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7LTxLvWwVrA3bKMPdWGdZZ2hB6GQfLks5vNKo5gaJpZM4a6hz_ .

Feb 14 '19 00:02 gizatt

Interesting!

Greg what if you do it in a Docker container you've used with PyTorch?

On Wed, Feb 13, 2019 at 7:39 PM Greg Izatt [email protected] wrote:

Yeah, 64m here.

@d354535de71e:~/spartan$ df -h | grep shm shm 64M 0 64M 0% /dev/shm

On Wed, Feb 13, 2019, 19:13 Pete Florence <[email protected] wrote:

Yes that would work but first would like to ascertain if anybody else has this issue.

I've done a lot of work with PyTorch in Docker before but haven't had this, so would like to understand what's different.

Is easy to test your own docker setup, just run:

df -h | grep shm

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/RobotLocomotion/spartan/issues/369#issuecomment-463429790 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AC7LTxLvWwVrA3bKMPdWGdZZ2hB6GQfLks5vNKo5gaJpZM4a6hz_

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/RobotLocomotion/spartan/issues/369#issuecomment-463435820, or mute the thread https://github.com/notifications/unsubscribe-auth/AFYQqNMWhX-0XeQtjdY3s_kOWpXqpMz5ks5vNLBVgaJpZM4a6hz_ .

Feb 14 '19 00:02 peteflorence

That will be a little harder to revive on a phone, I'll get back to you!

On Wed, Feb 13, 2019, 19:53 Pete Florence <[email protected] wrote:

Interesting!

Greg what if you do it in a Docker container you've used with PyTorch?

On Wed, Feb 13, 2019 at 7:39 PM Greg Izatt [email protected] wrote:

Yeah, 64m here.

@d354535de71e:~/spartan$ df -h | grep shm shm 64M 0 64M 0% /dev/shm

On Wed, Feb 13, 2019, 19:13 Pete Florence <[email protected] wrote:

Yes that would work but first would like to ascertain if anybody else has this issue.

I've done a lot of work with PyTorch in Docker before but haven't had this, so would like to understand what's different.

Is easy to test your own docker setup, just run:

df -h | grep shm

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/RobotLocomotion/spartan/issues/369#issuecomment-463429790

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AC7LTxLvWwVrA3bKMPdWGdZZ2hB6GQfLks5vNKo5gaJpZM4a6hz_

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/RobotLocomotion/spartan/issues/369#issuecomment-463435820 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AFYQqNMWhX-0XeQtjdY3s_kOWpXqpMz5ks5vNLBVgaJpZM4a6hz_

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobotLocomotion/spartan/issues/369#issuecomment-463439046, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7LTx-D60oaqRnBKNMnA8UaodqgAiErks5vNLOZgaJpZM4a6hz_ .

Feb 14 '19 01:02 gizatt

why not use: docker run --shm-size 8G

Feb 14 '19 04:02 patmarion

Yeah I tried that and for some reason it didn't work for me. I think maybe just the docker run string wasn't formatted correctly. I'll report back if I fix it

On Wed, Feb 13, 2019 at 11:21 PM Pat Marion [email protected] wrote:

why not use: docker run --shm-size 8G

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/RobotLocomotion/spartan/issues/369#issuecomment-463483788, or mute the thread https://github.com/notifications/unsubscribe-auth/AFYQqDWNYXptba2TDX2qd479zxfTqlUCks5vNOQ9gaJpZM4a6hz_ .

Feb 14 '19 04:02 peteflorence

Yeah I have it inside my spartan container as well.

manuelli@paladin-44:~/spartan$ df -h | grep shm
shm              64M     0   64M   0% /dev/shm

but inside pdc container I have 31G.

manuelli@paladin-44:~/code$ df -h | grep shm
tmpfs            32G  882M   31G   3% /dev/shm

So we must have something different between pdc and spartan docker containers that is causing this.

Feb 14 '19 14:02 manuelli

Thanks for checking. Yeah I think won’t be hard to switch to sharing all/more memory, like the command @patmarion mentioned

I am curious to try to learn if this has been affecting any robot software in general

On Thu, Feb 14, 2019 at 9:41 AM Lucas Manuelli [email protected] wrote:

Yeah I have it inside my spartan container as well.

manuelli@paladin-44:~/spartan$ df -h | grep shm shm 64M 0 64M 0% /dev/shm

but inside pdc container I have 31G.

manuelli@paladin-44:~/code$ df -h | grep shm tmpfs 32G 882M 31G 3% /dev/shm

So we must have something different between pdc and spartan docker containers that is causing this.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/RobotLocomotion/spartan/issues/369#issuecomment-463651412, or mute the thread https://github.com/notifications/unsubscribe-auth/AFYQqHyq7tsy1-oHLWYzhj9-AXM5OXkYks5vNXWVgaJpZM4a6hz_ .

Feb 14 '19 14:02 peteflorence

Resolved by either passing --ipc=host or --shm-size 8G

I did have the arg in the wrong spot in the docker_run.py string it builds up!

Feb 15 '19 15:02 peteflorence

Looked at it with @manuelli this morning

We might just want to add --ipc=host by default to spartan

Feb 15 '19 15:02 peteflorence

@peteflorence If both ipc=host and shm-size work for increasing shared memory, could you help me understand the difference?

Mar 28 '20 01:03 austinmw

Both solutions worked for me (though in a separate container that runs PyTorch). Root cause is still unknown? Otherwise perhaps this issue is resolved.

Aug 19 '20 00:08 gjstein

Is there a way to override the path used by Pytorch multiprocess (/dev/shm). Unfortunately, increasing shared memory is not possible for me. Something like %env JOBLIB_TEMP_FOLDER=/tmp, which works for sklearn.

Oct 01 '20 15:10 depshad

spartan spartan copied to clipboard

Docker shared memory issue and solution

Symptom

Cause

Temp Solution

Other notes

Long term solution

spartan
spartan copied to clipboard