deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

htop => Segmentation fault (core dumped)

Open elgalu opened this issue 1 year ago • 4 comments

How to reproduce

The easiest way to reproduce is with htop, but it happens with some other packages like https://github.com/nicolargo/glances

# laptop or any computer we tried
docker run --rm -ti \
  763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-ec2
# inside the container or K8s (also tested in a remote Pod)
apt update #=> ok
apt install htop #=> 2.2.0 (also happens with latest htop 3.2.2)
htop
#=>
Segmentation fault (core dumped)

In case is relevant: the stderr code is 139

All PyTorch images seem to be affected independently of CPU/GPU/Training/Inference/EC2/Sagemaker/etc

# confirmed segmentation fault with all of:
763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:2.0.0-cpu-py310-ubuntu20.04-ec2
763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu116-ubuntu20.04-ec2
763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-ec2
763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.13.0-gpu-py39-cu117-ubuntu20.04-ec2
763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.0.0-cpu-py310-ubuntu20.04-ec2
763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.13.1-cpu-py39-ubuntu20.04-sagemaker

Works fine with non-PT images, e.g. Tensorflow and MXNet

# example images where it works fine (confirmed)
763104351884.dkr.ecr.eu-central-1.amazonaws.com/tensorflow-training:2.12.0-cpu-py310-ubuntu20.04-ec2
763104351884.dkr.ecr.eu-central-1.amazonaws.com/tensorflow-training:2.12.0-gpu-py310-cu118-ubuntu20.04-ec2
763104351884.dkr.ecr.eu-central-1.amazonaws.com/mxnet-training:1.9.0-cpu-py38-ubuntu20.04-ec2
nvidia/cuda:11.7.1-base-ubuntu20.04

elgalu avatar Apr 15 '23 09:04 elgalu

Sorry to bug again but it's even easier to recreate without installing anything, just run watch -n 1 whoami #=> Segmentation fault (core dumped)

elgalu avatar Apr 19 '23 11:04 elgalu

Hello @elgalu

Thanks for reporting the issue. Can you try installing with conda

/opt/conda/bin/conda install htop

tejaschumbalkar avatar Apr 27 '23 20:04 tejaschumbalkar

@tejaschumbalkar did you try it?

which htop #=> /opt/conda/bin/htop
conda list htop #=> htop 3.2.2 h8228510_0 conda-forge
htop #=> Segmentation fault (core dumped)

elgalu avatar Apr 28 '23 04:04 elgalu

This is now solved in our custom container, seems a bit hacky though:

    && echo "htop&glances give segmentation fault because of ncurses version conflicts between conda and apt" \
    && rm /usr/lib/x86_64-linux-gnu/libncurses.so.6 \
    && ln -s /opt/conda/lib/libncurses.so.6 /usr/lib/x86_64-linux-gnu/libncurses.so.6 \
    && rm /usr/lib/x86_64-linux-gnu/libncurses.so.6.2 \
    && ln -s /opt/conda/lib/libncurses.so.6.3 /usr/lib/x86_64-linux-gnu/libncurses.so.6.2 \
    && rm /usr/lib/x86_64-linux-gnu/libncursesw.so.6.2 \
    && ln -s /opt/conda/lib/libncursesw.so.6.3 /usr/lib/x86_64-linux-gnu/libncursesw.so.6.2 \
    && rm /usr/lib/x86_64-linux-gnu/libncursesw.so.6 \
    && ln -s /opt/conda/lib/libncursesw.so.6 /usr/lib/x86_64-linux-gnu/libncursesw.so.6 \
    && rm /usr/lib/i386-linux-gnu/libncurses.so.6 \
    && ln -s /opt/conda/lib/libncurses.so.6 /usr/lib/i386-linux-gnu/libncurses.so.6 \

elgalu avatar May 03 '23 10:05 elgalu