unilm [kosmos-g] Problem about docker image setup

When installing xformers according to official instruction, it fails. Low version of torch + high version of xformers is difficult to install. Can anyone offer a docker image?

Oct 17 '23 09:10 caicj15

pip install xformers==0.0.13 is okay. However, there are other problems.

Oct 17 '23 10:10 fikry102

pip install xformers==0.0.13 is okay. However, there are other problems.

Yes, there is other problem and it still fails.

Oct 17 '23 10:10 caicj15

I can't install nvidia-apex using the following command:

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Oct 17 '23 10:10 fikry102

Hi @caicj15 @fikry102 , can you try this xichenpan/kosmosg:v1, it is committed through our research environment, it can work well in Microsoft local nodes and clusters.

Oct 18 '23 02:10 xichenpan

Hi @caicj15 @fikry102 , can you try this xichenpan/kosmosg:v1, it is committed through our research environment, it can work well in Microsoft local nodes and clusters.

Hi @apolinario, can you also try this image :D

Oct 18 '23 03:10 xichenpan

Hi @caicj15 @fikry102 , can you try this xichenpan/kosmosg:v1, it is committed through our research environment, it can work well in Microsoft local nodes and clusters.

I am herbert. I use your image, and "from torchscale.architecture.config import EncoderDecoderConfig" fails.

Oct 18 '23 07:10 caicj15

@caicj15 Would you mind try running this again

pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/

Oct 18 '23 08:10 xichenpan

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16 827d26fb884b4fa5ac88212daf64d6b

Oct 18 '23 08:10 fikry102

@caicj15 Would you mind try running this again
pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/

However, how can I debug the app.py with VScode? (or with PyCharm) It is easy when we use "python train.py xxxx": just add xxxx into "args" in launch.json. But for "python -m yyyy app.py xxxx", how can I debug the app.py?

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 \
  app.py None \
  --task kosmosg \
  --criterion kosmosg \
  --arch kosmosg_xl \
  --required-batch-size-multiple 1 \
  --dict-path data/dict.txt \
  --spm-model data/sentencepiece.bpe.model \
  --memory-efficient-fp16 \
  --ddp-backend=no_c10d \
  --distributed-no-spawn \
  --subln \
  --sope-rel-pos \
  --checkpoint-activations \
  --flash-attention \
  --pretrained-ckpt-path ./kosmosg_checkpoints/checkpoint_final.pt

Oct 18 '23 08:10 fikry102

Hi @fikry102 ，good to know! for Pycharm debug you can refer to https://intellij-support.jetbrains.com/hc/en-us/community/posts/360003879119-how-to-run-python-m-command-in-pycharm-

Oct 18 '23 08:10 xichenpan

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16

can you give me the command to install pytorch and xformers right version, i try install with pip but it still failed, tks @fikry102

Oct 26 '23 01:10 trangtv57

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16

@fikry102, I tried with pytorch==1.13.1 and xformers==0.0.16 (beginning with the author's docker) but still get many errors. I would be very grateful if you could provide the command you used to install the correct packages, including any other changes you had to make to the provided setup script.

Nov 07 '23 19:11 blistick

Did anyone successfully replicate the results??. Would love to know the environment used?

Nov 21 '23 22:11 Namangarg110

Did anyone successfully replicate the results??. Would love to know the environment used?

Hi @Namangarg110 , could you please try our docker, people said they success using following script:

docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ xichenpan/kosmosg:v1 /bin/bash
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/

Nov 21 '23 23:11 xichenpan

Hi @fikry102, Thank you for suggesting above fixes. However, when I pip install those two packages in the given docker image, I still get the following error. Could you suggest any solution?

ImportError: /opt/conda/lib/python3.8/site-packages/fused_layer_norm_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6038) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
demo/gradio_app.py FAILED

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16

Jul 01 '24 14:07 PoulamiSM