llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Broken on docker image?

Open tginart opened this issue 1 year ago • 6 comments

I am trying to follow the Quickstart guide on the mosaicml/pytorch docker image and running into issues when trying the exact commands.

The training step is broken. In particular, there seems to be an issue setting up the StreamingDataset required for training.

For example, the command in the training README:

python ../../llmfoundry/data/text_data.py --local_path ./my-copy-c4 --split val_small

is broken:

Bus error (core dumped)
root@e81df48d8ecb:/home/llm-foundry# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Is this just me or can anyone repro?

tginart avatar May 09 '23 20:05 tginart

If you are in /home/llm-foundry/scripts/train/ you should run python ../../llmfoundry/data/text_data.py --local_path ./my-copy-c4 --split val_small If you are in /home/llm-foundry, you should run python llmfoundry/data/text_data.py --local_path ./my-copy-c4 --split val_small

You can change local_path accordingly (based on where you want the dataset to be saved).

vchiley avatar May 09 '23 21:05 vchiley

Did this resolve your issue @tginart?

codestar12 avatar May 10 '23 17:05 codestar12

Hi! No, this did not. I am suspecting that there is some kind of issue in the current docker env with the StreamingDataset.

For example, running this script:

from PIL import Image
from shutil import rmtree
from uuid import uuid4
from streaming import MDSWriter

# Local or remote directory path to store the output compressed files.
# For remote directory, the output files are automatically upload to a remote cloud storage
# location.
out_root = 'dirname'

# A dictionary of input fields to an Encoder/Decoder type
columns = {
    'uuid': 'str',
    'img': 'jpeg',
    'clf': 'int'
}

# Compression algorithm name
compression = 'zstd'

# Hash algorithm name
hashes = 'sha1', 'xxh64'

# Generates random images and classes for input sample
samples = [
    {
        'uuid': str(uuid4()),
        'img': Image.fromarray(np.random.randint(0, 256, (32, 48, 3), np.uint8)),
        'clf': np.random.randint(10),
    }
    for _ in range(1000)
]

# Call `MDSWriter` to iterate through the input data and write into a shard `mds` file
with MDSWriter(out=out_root, columns=columns, compression=compression, hashes=hashes) as out:
    for sample in samples:
        out.write(sample)

import numpy as np
from PIL import Image
from shutil import rmtree
from uuid import uuid4
from streaming import MDSWriter
from torch.utils.data import DataLoader
from streaming import StreamingDataset

# Remote directory (S3 or local filesystem) where dataset is stored
remote_dir = 'dirname'
# Local directory where dataset is cached during operation
local_dir = 'dirname'
dataset = StreamingDataset(local=local_dir, remote=remote_dir, split=None, shuffle=True)

# Create PyTorch DataLoader
dataloader = DataLoader(dataset)

results in a similar error:

Bus error (core dumped)
root@e81df48d8ecb:/home/llm-foundry/scripts/train# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '`

All inside the docker image. Perhaps the issue is that I should be using mosaicml/composer instead? Is there any chance that the installation in llm-foundry can go awry in the mosaicml/pytorch container?

tginart avatar May 10 '23 20:05 tginart

FYI that script is pulled from the Streaming docs:

https://docs.mosaicml.com/projects/streaming/en/stable/getting_started/quick_start.html

tginart avatar May 10 '23 20:05 tginart

I'll give it a look. Thanks for bringing that to our attention.

codestar12 avatar May 10 '23 23:05 codestar12

I'm also able to reproduce this error and basically had to setup an environment to run outside of docker

Paladiamors avatar May 16 '23 04:05 Paladiamors

@Paladiamors Any luck with getting Triton's flash attention set up? I've tried 3 different machines/GPU types and close to a half dozen different envs/images and can't get that package to work!

tginart avatar May 17 '23 18:05 tginart

Hi @tginart and @Paladiamors , it would be helpful to share some more information about your machine and OS specs. Could you try to run https://github.com/mosaicml/composer/blob/dev/composer/utils/collect_env.py , and also share the docker image you are trying to run? I'll tag our streaming folks to take a look.

hanlint avatar May 18 '23 14:05 hanlint

@hanlint Basically I'm getting the sample problem as the poster here when running the docker image:

Bus error (core dumped)
root@e81df48d8ecb:/home/llm-foundry# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

the error message continues on with some kind of memory leak error. I'm running on a g5.12xlarge machine, the g4dn series machines do not support the bfloat data type so the newer machines are needed.

Ive run the script and the details of it are provided below:

Collecting system information...
---------------------------------
System Environment Report        
Created: 2023-05-19 02:38:42 UTC
---------------------------------

PyTorch information
-------------------
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.31

Python version: 3.10.10 (main, May 13 2023, 14:12:46) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-1035-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 530.30.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.13.1
[pip3] torch-optimizer==0.3.0
[pip3] torchmetrics==0.11.3
[pip3] torchtext==0.14.1
[pip3] torchvision==0.14.1
[conda] Could not collect


Composer information
--------------------
Composer version: 0.14.1
Composer commit hash: None
Host processor model name: AMD EPYC 7R32
Host processor core count: 32
Number of nodes: 1
Accelerator model name: NVIDIA A10G
Accelerators per node: 1
CUDA Device Count: 1

Paladiamors avatar May 19 '23 02:05 Paladiamors

Interesting! My issue was also on the A10. I'm using the g5.12xlarge instance type with the Deep Learning AMI GPU PyTorch 1.13.1 (Ubuntu 20.04) AMI.

I tried both the mosaicml/pytorch and mosaicml/composer dockers and neither worked.

tginart avatar May 19 '23 06:05 tginart

Thank you both for the information! cc: @karan6181 and @knighton

hanlint avatar May 19 '23 17:05 hanlint

Hi @tginart and @Paladiamors, I ran the python ../../llmfoundry/data/text_data.py --local_path ./my-copy-c4 --split val_small script on my cluster (GPU and CPU) with mosaicml/pytorch and mosaicml/composer and I was able to run the script without any issues. Below is the output

$ python llmfoundry/data/text_data.py --local_path ./my-copy-c4/ --split val_small
Reading val_small split from ./my-copy-c4/
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 1.99MB/s]
Downloading (…)olve/main/vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1.08M/1.08M [00:00<00:00, 2.50MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 457k/457k [00:00<00:00, 21.0MB/s]
Downloading (…)/main/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 13.8MB/s]
Downloading (…)cial_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 90.0/90.0 [00:00<00:00, 603kB/s]


#################### Batch 0 ####################
input_ids torch.Size([2, 32]) torch.int64
attention_mask torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
--------------------  Sample 0  --------------------
The woman who died after falling from a bridge over the A21 has been identified as a Sevenoaks mum.
Marta Kendle, 37, fell
--------------------  Sample 1  --------------------
^ Source: Wilson (1999, 2). The many changes of European political borders since Mozart's time make it difficult to assign him a unambiguous nationality; for


#################### Batch 1 ####################
input_ids torch.Size([2, 32]) torch.int64
attention_mask torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
--------------------  Sample 0  --------------------
You too can live in Bramblebury!
At the moment there are no houses for sale in Bramblebury.
To apply for one of the other
--------------------  Sample 1  --------------------
Start a continuously running clock.
and increases by 1 each minute.
back up to start round 16.
try to slow the descent at least a little


#################### Batch 2 ####################
input_ids torch.Size([2, 32]) torch.int64
attention_mask torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
--------------------  Sample 0  --------------------
How can I stop sending feedback reminders?
Can I automatically leave feedback to my buyers?
Can I get alerted when I get negative feedback?
Can
--------------------  Sample 1  --------------------
At Hargray, we use the latest technology to manage our diverse platforms and support operational success. We foster an environment where innovative solutions are the key to driving our


#################### Batch 3 ####################
input_ids torch.Size([2, 32]) torch.int64
attention_mask torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
--------------------  Sample 0  --------------------
Quick & Dirty Way to Bypass Themida Anti-Attach!
totally unlawful and you could be hold accountable for your actions in a court
--------------------  Sample 1  --------------------
Anderson, C. Leigh; Biscaye, Pierre E.; Reynolds, Travis W.
"title": "Data for Policy 2017: Government by Algorithm?"


#################### Batch 4 ####################
input_ids torch.Size([2, 32]) torch.int64
attention_mask torch.Size([2, 32]) torch.int64
labels torch.Size([2, 32]) torch.int64
--------------------  Sample 0  --------------------
Obtain the dimension of surface tension & state its S?
It is an agreement describing the various terms of the Partnership between the partners. This is a
--------------------  Sample 1  --------------------
UK flat roofers expanding businesses into EPDM with industry-leading support and resources!
Permaroof UK Ltd is delighted to announce that the latest figures

I am wondering if something wrong with the Deep Learning AMI or a /tmp directory. Can you check if your /tmp directory is not full ? One of the reason you will core dump issue is when /tmp directory gets full or, generally, an out-of-memory issue. Another place to look at is the shm-size of your docker image, ensure you have enough shared memory size inside a docker image.

Also, w.r.t the warning message There appear to be 2 leaked shared_memory objects to clean up at shutdown, you are seeing this because of a core-dump crash since process died unexpectedly and resources did not get clean'd up.

karan6181 avatar May 22 '23 15:05 karan6181

Closing this issue for now as it seems to have gone stale. Personally I think this looks like a shared memory issue -- if using AWS instances and Docker I think you have to be careful to allocate enough shared memory, we have run into this problem ourselves before and increasing shm-size fixed it.

abhi-mosaic avatar May 31 '23 00:05 abhi-mosaic

Closing this issue for now as it seems to have gone stale. Personally I think this looks like a shared memory issue -- if using AWS instances and Docker I think you have to be careful to allocate enough shared memory, we have run into this problem ourselves before and increasing shm-size fixed it.

Thank you!

matveybuk avatar Jan 25 '24 07:01 matveybuk