unilm icon indicating copy to clipboard operation
unilm copied to clipboard

[kosmos-g] Problem about docker image setup

Open caicj15 opened this issue 2 years ago • 15 comments

When installing xformers according to official instruction, it fails. Low version of torch + high version of xformers is difficult to install. Can anyone offer a docker image?

caicj15 avatar Oct 17 '23 09:10 caicj15

pip install xformers==0.0.13 is okay. However, there are other problems.

fikry102 avatar Oct 17 '23 10:10 fikry102

pip install xformers==0.0.13 is okay. However, there are other problems.

Yes, there is other problem and it still fails.

caicj15 avatar Oct 17 '23 10:10 caicj15

image

I can't install nvidia-apex using the following command:

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

fikry102 avatar Oct 17 '23 10:10 fikry102

Hi @caicj15 @fikry102 , can you try this xichenpan/kosmosg:v1, it is committed through our research environment, it can work well in Microsoft local nodes and clusters.

xichenpan avatar Oct 18 '23 02:10 xichenpan

Hi @caicj15 @fikry102 , can you try this xichenpan/kosmosg:v1, it is committed through our research environment, it can work well in Microsoft local nodes and clusters.

Hi @apolinario, can you also try this image :D

xichenpan avatar Oct 18 '23 03:10 xichenpan

Hi @caicj15 @fikry102 , can you try this xichenpan/kosmosg:v1, it is committed through our research environment, it can work well in Microsoft local nodes and clusters.

I am herbert. I use your image, and "from torchscale.architecture.config import EncoderDecoderConfig" fails.

caicj15 avatar Oct 18 '23 07:10 caicj15

@caicj15 Would you mind try running this again

pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/

xichenpan avatar Oct 18 '23 08:10 xichenpan

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16 827d26fb884b4fa5ac88212daf64d6b

fikry102 avatar Oct 18 '23 08:10 fikry102

@caicj15 Would you mind try running this again

pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/

However, how can I debug the app.py with VScode? (or with PyCharm) It is easy when we use "python train.py xxxx": just add xxxx into "args" in launch.json. But for "python -m yyyy app.py xxxx", how can I debug the app.py?

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 \
  app.py None \
  --task kosmosg \
  --criterion kosmosg \
  --arch kosmosg_xl \
  --required-batch-size-multiple 1 \
  --dict-path data/dict.txt \
  --spm-model data/sentencepiece.bpe.model \
  --memory-efficient-fp16 \
  --ddp-backend=no_c10d \
  --distributed-no-spawn \
  --subln \
  --sope-rel-pos \
  --checkpoint-activations \
  --flash-attention \
  --pretrained-ckpt-path ./kosmosg_checkpoints/checkpoint_final.pt

fikry102 avatar Oct 18 '23 08:10 fikry102

Hi @fikry102 ,good to know! for Pycharm debug you can refer to https://intellij-support.jetbrains.com/hc/en-us/community/posts/360003879119-how-to-run-python-m-command-in-pycharm-

xichenpan avatar Oct 18 '23 08:10 xichenpan

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16 827d26fb884b4fa5ac88212daf64d6b

can you give me the command to install pytorch and xformers right version, i try install with pip but it still failed, tks @fikry102

trangtv57 avatar Oct 26 '23 01:10 trangtv57

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16 827d26fb884b4fa5ac88212daf64d6b

@fikry102, I tried with pytorch==1.13.1 and xformers==0.0.16 (beginning with the author's docker) but still get many errors. I would be very grateful if you could provide the command you used to install the correct packages, including any other changes you had to make to the provided setup script.

blistick avatar Nov 07 '23 19:11 blistick

Did anyone successfully replicate the results??. Would love to know the environment used?

Namangarg110 avatar Nov 21 '23 22:11 Namangarg110

Did anyone successfully replicate the results??. Would love to know the environment used?

Hi @Namangarg110 , could you please try our docker, people said they success using following script:

docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ xichenpan/kosmosg:v1 /bin/bash
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/

xichenpan avatar Nov 21 '23 23:11 xichenpan

Hi @fikry102, Thank you for suggesting above fixes. However, when I pip install those two packages in the given docker image, I still get the following error. Could you suggest any solution?

ImportError: /opt/conda/lib/python3.8/site-packages/fused_layer_norm_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6038) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
demo/gradio_app.py FAILED

I can successfully run this repo by using pytorch==1.13.1 and xformers==0.0.16 827d26fb884b4fa5ac88212daf64d6b

PoulamiSM avatar Jul 01 '24 14:07 PoulamiSM