fairseq Denoising Task crashes OOM

Denoising Task crashes OOM

Open BUCKFAE opened this issue 1 year ago • 0 comments

Hey!

We are trying to train a BART-Model for German from scratch using the GC4 Corpus. For testing purposes, we use only 20GB of the Dataset for training in a container with 250GB of RAM and one NVIDIA A100.

Dockerfile

FROM nvidia/cuda:11.3.1-devel-ubuntu20.04

SHELL ["/bin/bash", "-c"]

ENV PYTHONUNBUFFERED=1
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=Europe/Berlin

RUN apt-get update && apt-get install -y \
 && apt-get install -y software-properties-common \
 && add-apt-repository -y ppa:deadsnakes/ppa \
 && apt-get update && apt-get install -y \
 python3.9-dev \
 python3.9-venv \
 python3.9-distutils \
 python3-pip \
 git \
 llvm \
 vim \
 neovim \
 tree \
 curl \
 wget \
 htop \
 zsh \
 && rm -rf /var/lib/apt/lists/*

ENV HOME=/tmp
RUN ln -sf /usr/bin/python3.9 /usr/bin/python3

WORKDIR /workdir/code

COPY requirements.txt .
RUN pip3 install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113

Requirements.txt

aiohttp==3.8.1; python_version >= "3.7"
aiosignal==1.2.0; python_version >= "3.6"
async-timeout==4.0.2; python_version >= "3.6"
attrs==22.1.0; python_version >= "3.6"
blis==0.7.8; python_version >= "3.6"
catalogue==2.0.8; python_version >= "3.6"
certifi==2022.6.15; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
charset-normalizer==2.1.0; python_full_version >= "3.7.0" and python_version >= "3.7" and python_version < "4"
click==8.1.3; python_version >= "3.7"
colorama==0.4.5; python_full_version >= "3.7.0" and platform_system == "Windows" and python_version >= "3.6" and (python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" or platform_system == "Windows" and python_version >= "3.7" and python_full_version >= "3.5.0")
cymem==2.0.6; python_version >= "3.6"
datasets==2.4.0
dill==0.3.5.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.7.0"
docker-pycreds==0.4.0; python_version >= "3.6"
elastic-transport==8.1.2; python_version >= "3.6" and python_version < "4"
elasticsearch==8.3.3; python_version >= "3.6" and python_version < "4"
filelock==3.8.0; python_version >= "3.7" and python_full_version >= "3.7.0"
frozenlist==1.3.1; python_version >= "3.7"
fsspec==2022.7.1; python_version >= "3.7"
gitdb==4.0.9; python_version >= "3.7"
gitpython==3.1.27; python_version >= "3.7"
huggingface-hub==0.8.1; python_full_version >= "3.7.0"
idna==3.3; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
jinja2==3.1.2; python_version >= "3.7"
langcodes==3.3.0; python_version >= "3.6"
markupsafe==2.1.1; python_version >= "3.7"
multidict==6.0.2; python_version >= "3.7"
multiprocess==0.70.13; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.7.0"
murmurhash==1.0.7; python_version >= "3.6"
numpy==1.23.1
packaging==21.3; python_version >= "3.6" and python_full_version >= "3.7.0"
pandas==1.4.3; python_version >= "3.8"
pathtools==0.1.2; python_version >= "3.6"
pathy==0.6.2; python_version >= "3.6"
preshed==3.0.6; python_version >= "3.6"
promise==2.3; python_version >= "3.6"
protobuf==3.20.1; python_version >= "3.7"
psutil==5.9.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pyarrow==9.0.0; python_version >= "3.7"
pydantic==1.9.2; python_full_version >= "3.6.1" and python_version >= "3.6"
pyparsing==3.0.9; python_version >= "3.6" and python_full_version >= "3.7.0"
python-dateutil==2.8.2; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
pytz==2022.2; python_version >= "3.8"
pyyaml==6.0; python_version >= "3.6" and python_full_version >= "3.7.0"
regex==2022.7.25; python_version >= "3.6" and python_full_version >= "3.7.0"
requests==2.28.1; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
responses==0.18.0; python_version >= "3.7"
sentry-sdk==1.9.4; python_version >= "3.6"
setproctitle==1.3.2; python_version >= "3.7"
shortuuid==1.0.9; python_version >= "3.6"
six==1.16.0; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
smart-open==5.2.1; python_version >= "3.6" and python_version < "4.0"
smmap==5.0.0; python_version >= "3.7"
spacy-legacy==3.0.9; python_version >= "3.6"
spacy-loggers==1.0.3; python_version >= "3.6"
spacy==3.4.1; python_version >= "3.6"
srsly==2.4.4; python_version >= "3.6"
thinc==8.1.0; python_version >= "3.6"
tokenizers==0.12.1; python_full_version >= "3.7.0"
tqdm==4.64.0; python_full_version >= "3.7.0" and python_version >= "3.6"
transformers==4.21.1; python_full_version >= "3.7.0"
typer==0.4.2; python_version >= "3.6"
typing-extensions==4.3.0; python_version >= "3.7" and python_full_version >= "3.7.0"
urllib3==1.26.11; python_full_version >= "3.7.0" and python_version < "4" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.7") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6")
wandb==0.13.1; python_version >= "3.6"
wasabi==0.10.1; python_version >= "3.6"
xxhash==3.0.0; python_version >= "3.6"
yarl==1.8.1; python_version >= "3.7"
ftfy==6.1.1
git+https://github.com/facebookresearch/fairseq.git
tensorboardX==2.5.1
debugpy==1.6.3

Most importantly, we use git+https://github.com/facebookresearch/fairseq.git to install fairseq as we could not get the denoising task to work when installing fairseq from PyPI. We used the following commit: 176cd934982212a4f75e0669ee81b834ee71dbb0

We use the following minimal example:

#!/bin/bash
train_dir="/datasets/text/germancolossal4/debug"
out_dir="data/preprocessed"

echo "Preprocessing data: ${train_dir}"

fairseq-preprocess \
    --trainpref ${train_dir}/train \
    --validpref ${train_dir}/valid \
    --testpref ${train_dir}/test \
    --task denoising \
    --criterion cross_entropy \
    --optimizer adam \
    --only-source \
    --workers 1 \
    --destdir ${out_dir}

echo "Finished preprocessing:"

fairseq-train ${out_dir} \
    --task denoising \
    --arch bart_base \
    --batch-size 1 \
    --skip-invalid-size-inputs-valid-test \
    --optimizer adam

We only use one worker for the preprocessing as fairseq-preprocess gets stuck when using more than one worker.

When running this script with a train-file with 20GB of data, fairseq-train runs out of memory and the container crashes without any error messages. When adding a wandb project, we observed that the training of the first epoch starts but does not complete, the pod runs out of memory before that. The minimal training command we build follows the suggestions from #1899. This issue might be related to #4930.

Is this amount of RAM usage expected?

Thank you very much in advance!

Apr 13 '23 18:04 BUCKFAE

fairseq fairseq copied to clipboard

Denoising Task crashes OOM

fairseq
fairseq copied to clipboard