composer icon indicating copy to clipboard operation
composer copied to clipboard

Composer container image doesn't show exceptions in logs when Trainer crashes

Open jwatte opened this issue 1 year ago • 3 comments

** Environment ** Collecting system information...

System Environment Report Created: 2023-05-24 19:24:21 UTC

PyTorch information

PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 Clang version: Could not collect CMake version: version 3.26.3 Libc version: glibc-2.35

Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.19.0-1025-aws-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G GPU 1: NVIDIA A10G GPU 2: NVIDIA A10G GPU 3: NVIDIA A10G

Nvidia driver version: 530.30.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] pytorch-ranger==0.1.1 [pip3] torch==1.13.1 [pip3] torch-optimizer==0.3.0 [pip3] torchmetrics==0.11.3 [pip3] torchtext==0.14.1 [pip3] torchvision==0.14.1 [conda] Could not collect

Composer information

Composer version: 0.14.1 Composer commit hash: None Host processor model name: AMD EPYC 7R32 Host processor core count: 48 Number of nodes: 1 Accelerator model name: NVIDIA A10G Accelerators per node: 1 CUDA Device Count: 4

** To reproduce

Steps to reproduce the behavior:

  1. make a dataset for fine tuning mpt-7b-instruct
  2. build trainer docker image based on mosaicml/composer:latest, adding necessary prerequisites
  3. docker run --runtime-nvidia -it --rm composer composer train.py mymodel.yaml
  4. Note that datasets load, and it starts Building trainer... and then it exits with

Expected behavior

Fine tuning should start, or at least, when it fails, a reasonable exception / stack trace / error message should be printed.

Additional context

Dockerfile:

FROM mosaicml/composer
RUN apt-get update && apt install -y vim git
RUN pip3.10 install --upgrade pip
RUN pip3.10 install composer omegaconf
WORKDIR /mpt/
RUN git clone https://github.com/mosaicml/llm-foundry
WORKDIR /mpt/llm-foundry
RUN pip3.10 install -e ".[gpu]"
WORKDIR /mpt/llm-foundry/scripts/train

start script:

docker run --runtime=nvidia --rm -it --name train -v /opt/mpt-7b:/mpt -v /opt/mpt-7b/cache:/root/.cache train composer train.py observe-help.yaml

log:

Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef4f1a35606f162a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Building trainer...
ERROR:composer.cli.launcher:Rank 3 crashed with exit code -7.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.

stdout:

Global rank 3 (PID 104) exited with code -7
----------Begin global rank 3 STDOUT----------
Initializing model...
cfg.n_params=6.65e+09
Building train loader...
Importing preprocessing function via: `from preprocess_observe_help import o11y_help`
Building eval loader...
Importing preprocessing function via: `from preprocess_observe_help import o11y_help`
Building trainer...

----------End global rank 3 STDOUT----------

stderr:

----------Begin global rank 3 STDERR----------
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-instruct/a858cfabdc6bf69c03ce63236a5e877517bb957c/attention.py:153: UserWarning: While `attn_impl: triton` can be faster than `attn_impl: flash` it uses more memory. When training larger models this can trigger alloc retries which hurts performance. If encountered, we recommend using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.
  warnings.warn('While `attn_impl: triton` can be faster than `attn_impl: flash` ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:06<00:06,  6.01s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  3.70s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.05s/it]
Using pad_token, but it is not set yet.
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef4f1a35606f162a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)

Map:   0%|          | 0/2610 [00:00<?, ? examples/s]
Map:   4%|▍         | 111/2610 [00:00<00:02, 1088.71 examples/s]
Map:   9%|▊         | 224/2610 [00:00<00:02, 1109.61 examples/s]
Map:  14%|█▍        | 377/2610 [00:00<00:02, 1053.76 examples/s]
Map:  20%|██        | 527/2610 [00:00<00:02, 1023.82 examples/s]
Map:  24%|██▍       | 635/2610 [00:00<00:01, 1032.76 examples/s]
Map:  29%|██▊       | 745/2610 [00:00<00:01, 1048.90 examples/s]
Map:  33%|███▎      | 862/2610 [00:00<00:01, 1083.13 examples/s]
Map:  38%|███▊      | 1000/2610 [00:01<00:01, 919.49 examples/s]
Map:  42%|████▏     | 1103/2610 [00:01<00:01, 942.95 examples/s]
Map:  47%|████▋     | 1214/2610 [00:01<00:01, 985.67 examples/s]
Map:  52%|█████▏    | 1362/2610 [00:01<00:01, 984.68 examples/s]
Map:  56%|█████▌    | 1464/2610 [00:01<00:01, 988.78 examples/s]
Map:  60%|██████    | 1571/2610 [00:01<00:01, 1007.56 examples/s]
Map:  64%|██████▍   | 1676/2610 [00:01<00:00, 1014.13 examples/s]
Map:  68%|██████▊   | 1781/2610 [00:01<00:00, 1021.52 examples/s]
Map:  72%|███████▏  | 1886/2610 [00:01<00:00, 1027.09 examples/s]
Map:  76%|███████▌  | 1990/2610 [00:01<00:00, 1026.62 examples/s]
Map:  80%|████████  | 2096/2610 [00:02<00:00, 866.59 examples/s]
Map:  85%|████████▌ | 2229/2610 [00:02<00:00, 869.06 examples/s]
Map:  91%|█████████ | 2376/2610 [00:02<00:00, 901.29 examples/s]
Map:  95%|█████████▍| 2475/2610 [00:02<00:00, 920.22 examples/s]
Map:  98%|█████████▊| 2570/2610 [00:02<00:00, 925.10 examples/s]

Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef4f1a35606f162a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)

Map:   0%|          | 0/315 [00:00<?, ? examples/s]
Map:  35%|███▍      | 110/315 [00:00<00:00, 1084.28 examples/s]
Map:  85%|████████▍ | 267/315 [00:00<00:00, 1055.64 examples/s]


----------End global rank 3 STDERR----------

What is exit code -7? Why did the process die inside Trainer()? Who knows!

(The stdout/stderr printouts contain nothing better)

jwatte avatar May 24 '23 19:05 jwatte

The bug here is "when trainer crashes, there are no logs printed." If I don't use the container image, but instead install all dependencies locally, and start composer, it does print a helpful error message that it's running out of RAM on the 20 GB available on A10G. I will edit the topic to clarify.

So, when the spawned processes crash for some reason, when running in docker, the composer subprocess spawner seems to not be able to harvest the exception from standard error of the subprocesses. Maybe it dies too quickly before it gets flushed out or something.

jwatte avatar May 25 '23 16:05 jwatte

Hi @jwatte thanks for reporting this! Just to clarify, you're expecting a CUDA OOM error to be printed to the screen, correct?

We don't do anything in the Docker image that would inhibit printing of exceptions. It does look like you use the composer launcher which waits 30s for child processes to clean up, which I would think is long enough for any buffered prints to flush.

As a quick experiment, could you try setting the following env var and try to reproduce? export PYTHONUNBUFFERED=True

For more information: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONUNBUFFERED

bandish-shah avatar May 30 '23 20:05 bandish-shah

Have the same issue.

Output looks exactly like mine. But I don't get an OOM like bandish-shah suggests. I set 'export PYTHONUNBUFFERED=True' and don't receive any new information for this error:

I start the training process using this image (mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04) and the provided setup.py. And start it with composer: composer train.py yamls/finetune/mpt-7b_dolly_sft.yaml

Einengutenmorgen avatar Jun 06 '23 08:06 Einengutenmorgen

Closing as this is stale and I spent some time but I'm unable to reproduce :(

If this is still an issue, please feel free to reopen! I am happy to investigate further.

mvpatel2000 avatar Mar 13 '24 18:03 mvpatel2000