alpa icon indicating copy to clipboard operation
alpa copied to clipboard

UnboundLocalError: local variable 'state' referenced before assignment

Open liguodongiot opened this issue 2 years ago • 1 comments

when I run this demo , an error occurred

INFO:__main__:***** Running training *****
INFO:__main__:  Num examples = 117750
INFO:__main__:  Num Epochs = 8
INFO:__main__:  Batch size per device (w. accumulation) = 20
INFO:__main__:  Global train batch size (w. parallel & distributed) = 80
INFO:__main__:  Total optimization steps = 11768
Initial compilation. This might take some minutes...
Epoch ... :   0%|                                                                                                               | 0/8 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/guodong.li/code/alpa/examples/opt_finetune/run_clm_flax.py", line 1219, in <module>
    main()
  File "/home/guodong.li/code/alpa/examples/opt_finetune/run_clm_flax.py", line 1085, in main
    state, train_metric = p_train_step(state, batch)
UnboundLocalError: local variable 'state' referenced before assignment

env:

  • python3.9
  • cuda11.3

command:

PYTHONPATH=/home/guodong.li/code/alpa/ python3 run_clm_flax.py \
    --output_dir="/home/guodong.li/data/train_result/output_opt" \
    --model_name_or_path="facebook/opt-2.7b" \
    --dataset_name="wikitext" \
    --dataset_config_name="wikitext-2-raw-v1" \
    --do_train --do_eval \
    --block_size="1024" \
    --per_device_train_batch_size="20" \
    --per_device_eval_batch_size="20" \
    --num_micro_batches 4 \
    --operator_parallel 2 \
    --pipeline_parallel 1 \
    --dtype="float16" \
    --learning_rate="5e-4" --warmup_steps="2000" \
    --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
    --overwrite_output_dir \
    --num_train_epochs="8" \
    --logging_steps="16" \
    --save_steps="2500" \
    --eval_steps="2500"

dependency:

Package                      Version
---------------------------- -----------------------
absl-py                      1.4.0
aiohttp                      3.8.4
aiosignal                    1.3.1
alpa                         0.2.2
astunparse                   1.6.3
async-timeout                4.0.2
attrs                        22.2.0
cached-property              1.5.2
cachetools                   5.3.0
certifi                      2022.12.7
charset-normalizer           3.0.1
chex                         0.1.5
click                        8.0.4
colorama                     0.4.6
contourpy                    1.0.7
cupy-cuda11x                 11.5.0
cycler                       0.11.0
datasets                     2.9.0
dill                         0.3.6
distlib                      0.3.6
dm-tree                      0.1.8
etils                        1.0.0
evaluate                     0.4.0
fastrlock                    0.8.1
filelock                     3.9.0
flatbuffers                  2.0.7
flax                         0.6.2
fonttools                    4.38.0
frozenlist                   1.3.3
fsspec                       2023.1.0
gast                         0.4.0
google-auth                  2.16.1
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
grpcio                       1.48.2
h5py                         3.8.0
huggingface-hub              0.12.1
idna                         3.4
importlib-metadata           6.0.0
importlib-resources          5.12.0
jax                          0.3.22
jaxlib                       0.3.22+cuda113.cudnn820
joblib                       1.2.0
jsonschema                   4.17.3
keras                        2.7.0
Keras-Preprocessing          1.1.2
kiwisolver                   1.4.4
libclang                     15.0.6.1
llvmlite                     0.39.1
lxml                         4.9.2
Markdown                     3.4.1
markdown-it-py               2.1.0
MarkupSafe                   2.1.2
matplotlib                   3.7.0
mdurl                        0.1.2
msgpack                      1.0.4
multidict                    6.0.4
multiprocess                 0.70.14
numba                        0.56.4
numpy                        1.23.5
oauthlib                     3.2.2
opt-einsum                   3.3.0
optax                        0.1.4
orbax                        0.1.2
packaging                    23.0
pandas                       1.5.3
Pillow                       9.4.0
pip                          22.3.1
platformdirs                 3.0.0
portalocker                  2.7.0
protobuf                     3.19.6
PuLP                         2.7.0
pyarrow                      11.0.0
pyasn1                       0.4.8
pyasn1-modules               0.2.8
Pygments                     2.14.0
pyparsing                    3.0.9
pyrsistent                   0.19.3
python-dateutil              2.8.2
pytz                         2022.7.1
PyYAML                       6.0
ray                          2.1.0
redis                        4.5.1
regex                        2022.10.31
requests                     2.28.2
requests-oauthlib            1.3.1
responses                    0.18.0
rich                         13.3.1
rsa                          4.9
sacrebleu                    2.3.1
scikit-learn                 1.2.1
scipy                        1.10.1
setuptools                   65.6.3
six                          1.16.0
tabulate                     0.9.0
tensorboard                  2.12.0
tensorboard-data-server      0.7.0
tensorboard-plugin-wit       1.8.1
tensorflow-estimator         2.7.0
tensorflow-gpu               2.7.0
tensorflow-io-gcs-filesystem 0.30.0
tensorstore                  0.1.32
termcolor                    2.2.0
threadpoolctl                3.1.0
tokenizers                   0.13.2
toolz                        0.12.0
tqdm                         4.64.1
transformers                 4.26.1
typing_extensions            4.5.0
urllib3                      1.26.14
virtualenv                   20.19.0
Werkzeug                     2.2.3
wheel                        0.38.4
wrapt                        1.14.1
xxhash                       3.2.0
yarl                         1.8.2
zipp                         3.14.0

What should I do? thanks.

liguodongiot avatar Feb 24 '23 10:02 liguodongiot

when i use pipeline parallel, other error occurred:

2023-02-28 11:24:36,070 ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.load_opt_params_worker_func() (pid=12770, ip=10.xx.2.46, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fe178157b20>)
  File "/home/guodong.li/code/alpa/examples/opt_finetune/load_params.py", line 147, in load_opt_params_worker_func
    load_array("decoder.embed_tokens.weight"))
  File "/home/guodong.li/code/alpa/examples/opt_finetune/load_params.py", line 121, in load_array
    return np.load(os.path.join(path, key))
  File "/home/guodong.li/virtual-venv/alpa-venv-py39/lib/python3.9/site-packages/numpy/lib/npyio.py", line 405, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/data/nfs/guodong.li/pretrain/opt-2.7b/decoder.embed_tokens.weight'

command:

PYTHONPATH=/home/guodong.li/code/alpa/ python3 run_clm_flax.py \
    --output_dir="/home/guodong.li/data/train_result/output_opt" \
    --config_name="./config_30b.json" \
    --tokenizer_name="facebook/opt-30b" \
    --alpa_init \
    --use_manual_layer \
    --dataset_name="wikitext" \
    --dataset_config_name="wikitext-2-raw-v1" \
    --do_train \
    --block_size="1024" \
    --per_device_train_batch_size="1024" \
    --per_device_eval_batch_size="64" \
    --num_micro_batches 256 \
    --operator_parallel 1 \
    --pipeline_parallel 8 \
    --dtype="float16" \
    --learning_rate="5e-4" --warmup_steps="2000" \
    --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
    --overwrite_output_dir \
    --num_train_epochs="10" \
    --logging_steps="1" \
    --save_steps="888" \
    --eval_steps="888"

When I check the configuration file, I find this file is missing?

image

Where should I download it?@zhisbug

best wishes!

liguodongiot avatar Feb 26 '23 15:02 liguodongiot