fairseq
fairseq copied to clipboard
Denoising Task crashes OOM
Hey!
We are trying to train a BART-Model for German from scratch using the GC4 Corpus. For testing purposes, we use only 20GB of the Dataset for training in a container with 250GB of RAM and one NVIDIA A100.
Dockerfile
FROM nvidia/cuda:11.3.1-devel-ubuntu20.04
SHELL ["/bin/bash", "-c"]
ENV PYTHONUNBUFFERED=1
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=Europe/Berlin
RUN apt-get update && apt-get install -y \
&& apt-get install -y software-properties-common \
&& add-apt-repository -y ppa:deadsnakes/ppa \
&& apt-get update && apt-get install -y \
python3.9-dev \
python3.9-venv \
python3.9-distutils \
python3-pip \
git \
llvm \
vim \
neovim \
tree \
curl \
wget \
htop \
zsh \
&& rm -rf /var/lib/apt/lists/*
ENV HOME=/tmp
RUN ln -sf /usr/bin/python3.9 /usr/bin/python3
WORKDIR /workdir/code
COPY requirements.txt .
RUN pip3 install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113
Requirements.txt
aiohttp==3.8.1; python_version >= "3.7"
aiosignal==1.2.0; python_version >= "3.6"
async-timeout==4.0.2; python_version >= "3.6"
attrs==22.1.0; python_version >= "3.6"
blis==0.7.8; python_version >= "3.6"
catalogue==2.0.8; python_version >= "3.6"
certifi==2022.6.15; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
charset-normalizer==2.1.0; python_full_version >= "3.7.0" and python_version >= "3.7" and python_version < "4"
click==8.1.3; python_version >= "3.7"
colorama==0.4.5; python_full_version >= "3.7.0" and platform_system == "Windows" and python_version >= "3.6" and (python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" or platform_system == "Windows" and python_version >= "3.7" and python_full_version >= "3.5.0")
cymem==2.0.6; python_version >= "3.6"
datasets==2.4.0
dill==0.3.5.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.7.0"
docker-pycreds==0.4.0; python_version >= "3.6"
elastic-transport==8.1.2; python_version >= "3.6" and python_version < "4"
elasticsearch==8.3.3; python_version >= "3.6" and python_version < "4"
filelock==3.8.0; python_version >= "3.7" and python_full_version >= "3.7.0"
frozenlist==1.3.1; python_version >= "3.7"
fsspec==2022.7.1; python_version >= "3.7"
gitdb==4.0.9; python_version >= "3.7"
gitpython==3.1.27; python_version >= "3.7"
huggingface-hub==0.8.1; python_full_version >= "3.7.0"
idna==3.3; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
jinja2==3.1.2; python_version >= "3.7"
langcodes==3.3.0; python_version >= "3.6"
markupsafe==2.1.1; python_version >= "3.7"
multidict==6.0.2; python_version >= "3.7"
multiprocess==0.70.13; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.7.0"
murmurhash==1.0.7; python_version >= "3.6"
numpy==1.23.1
packaging==21.3; python_version >= "3.6" and python_full_version >= "3.7.0"
pandas==1.4.3; python_version >= "3.8"
pathtools==0.1.2; python_version >= "3.6"
pathy==0.6.2; python_version >= "3.6"
preshed==3.0.6; python_version >= "3.6"
promise==2.3; python_version >= "3.6"
protobuf==3.20.1; python_version >= "3.7"
psutil==5.9.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pyarrow==9.0.0; python_version >= "3.7"
pydantic==1.9.2; python_full_version >= "3.6.1" and python_version >= "3.6"
pyparsing==3.0.9; python_version >= "3.6" and python_full_version >= "3.7.0"
python-dateutil==2.8.2; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
pytz==2022.2; python_version >= "3.8"
pyyaml==6.0; python_version >= "3.6" and python_full_version >= "3.7.0"
regex==2022.7.25; python_version >= "3.6" and python_full_version >= "3.7.0"
requests==2.28.1; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
responses==0.18.0; python_version >= "3.7"
sentry-sdk==1.9.4; python_version >= "3.6"
setproctitle==1.3.2; python_version >= "3.7"
shortuuid==1.0.9; python_version >= "3.6"
six==1.16.0; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
smart-open==5.2.1; python_version >= "3.6" and python_version < "4.0"
smmap==5.0.0; python_version >= "3.7"
spacy-legacy==3.0.9; python_version >= "3.6"
spacy-loggers==1.0.3; python_version >= "3.6"
spacy==3.4.1; python_version >= "3.6"
srsly==2.4.4; python_version >= "3.6"
thinc==8.1.0; python_version >= "3.6"
tokenizers==0.12.1; python_full_version >= "3.7.0"
tqdm==4.64.0; python_full_version >= "3.7.0" and python_version >= "3.6"
transformers==4.21.1; python_full_version >= "3.7.0"
typer==0.4.2; python_version >= "3.6"
typing-extensions==4.3.0; python_version >= "3.7" and python_full_version >= "3.7.0"
urllib3==1.26.11; python_full_version >= "3.7.0" and python_version < "4" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.7") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6")
wandb==0.13.1; python_version >= "3.6"
wasabi==0.10.1; python_version >= "3.6"
xxhash==3.0.0; python_version >= "3.6"
yarl==1.8.1; python_version >= "3.7"
ftfy==6.1.1
git+https://github.com/facebookresearch/fairseq.git
tensorboardX==2.5.1
debugpy==1.6.3
Most importantly, we use git+https://github.com/facebookresearch/fairseq.git
to install fairseq as we could not get the denoising task to work when installing fairseq from PyPI.
We used the following commit: 176cd934982212a4f75e0669ee81b834ee71dbb0
We use the following minimal example:
#!/bin/bash
train_dir="/datasets/text/germancolossal4/debug"
out_dir="data/preprocessed"
echo "Preprocessing data: ${train_dir}"
fairseq-preprocess \
--trainpref ${train_dir}/train \
--validpref ${train_dir}/valid \
--testpref ${train_dir}/test \
--task denoising \
--criterion cross_entropy \
--optimizer adam \
--only-source \
--workers 1 \
--destdir ${out_dir}
echo "Finished preprocessing:"
fairseq-train ${out_dir} \
--task denoising \
--arch bart_base \
--batch-size 1 \
--skip-invalid-size-inputs-valid-test \
--optimizer adam
We only use one worker for the preprocessing as fairseq-preprocess
gets stuck when using more than one worker.
When running this script with a train
-file with 20GB of data, fairseq-train
runs out of memory and the container crashes without any error messages. When adding a wandb project, we observed that the training of the first epoch starts but does not complete, the pod runs out of memory before that.
The minimal training command we build follows the suggestions from #1899. This issue might be related to #4930.
Is this amount of RAM usage expected?
Thank you very much in advance!