alpa
alpa copied to clipboard
UnboundLocalError: local variable 'state' referenced before assignment
when I run this demo , an error occurred
INFO:__main__:***** Running training *****
INFO:__main__: Num examples = 117750
INFO:__main__: Num Epochs = 8
INFO:__main__: Batch size per device (w. accumulation) = 20
INFO:__main__: Global train batch size (w. parallel & distributed) = 80
INFO:__main__: Total optimization steps = 11768
Initial compilation. This might take some minutes...
Epoch ... : 0%| | 0/8 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/guodong.li/code/alpa/examples/opt_finetune/run_clm_flax.py", line 1219, in <module>
main()
File "/home/guodong.li/code/alpa/examples/opt_finetune/run_clm_flax.py", line 1085, in main
state, train_metric = p_train_step(state, batch)
UnboundLocalError: local variable 'state' referenced before assignment
env:
- python3.9
- cuda11.3
command:
PYTHONPATH=/home/guodong.li/code/alpa/ python3 run_clm_flax.py \
--output_dir="/home/guodong.li/data/train_result/output_opt" \
--model_name_or_path="facebook/opt-2.7b" \
--dataset_name="wikitext" \
--dataset_config_name="wikitext-2-raw-v1" \
--do_train --do_eval \
--block_size="1024" \
--per_device_train_batch_size="20" \
--per_device_eval_batch_size="20" \
--num_micro_batches 4 \
--operator_parallel 2 \
--pipeline_parallel 1 \
--dtype="float16" \
--learning_rate="5e-4" --warmup_steps="2000" \
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
--overwrite_output_dir \
--num_train_epochs="8" \
--logging_steps="16" \
--save_steps="2500" \
--eval_steps="2500"
dependency:
Package Version
---------------------------- -----------------------
absl-py 1.4.0
aiohttp 3.8.4
aiosignal 1.3.1
alpa 0.2.2
astunparse 1.6.3
async-timeout 4.0.2
attrs 22.2.0
cached-property 1.5.2
cachetools 5.3.0
certifi 2022.12.7
charset-normalizer 3.0.1
chex 0.1.5
click 8.0.4
colorama 0.4.6
contourpy 1.0.7
cupy-cuda11x 11.5.0
cycler 0.11.0
datasets 2.9.0
dill 0.3.6
distlib 0.3.6
dm-tree 0.1.8
etils 1.0.0
evaluate 0.4.0
fastrlock 0.8.1
filelock 3.9.0
flatbuffers 2.0.7
flax 0.6.2
fonttools 4.38.0
frozenlist 1.3.3
fsspec 2023.1.0
gast 0.4.0
google-auth 2.16.1
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
grpcio 1.48.2
h5py 3.8.0
huggingface-hub 0.12.1
idna 3.4
importlib-metadata 6.0.0
importlib-resources 5.12.0
jax 0.3.22
jaxlib 0.3.22+cuda113.cudnn820
joblib 1.2.0
jsonschema 4.17.3
keras 2.7.0
Keras-Preprocessing 1.1.2
kiwisolver 1.4.4
libclang 15.0.6.1
llvmlite 0.39.1
lxml 4.9.2
Markdown 3.4.1
markdown-it-py 2.1.0
MarkupSafe 2.1.2
matplotlib 3.7.0
mdurl 0.1.2
msgpack 1.0.4
multidict 6.0.4
multiprocess 0.70.14
numba 0.56.4
numpy 1.23.5
oauthlib 3.2.2
opt-einsum 3.3.0
optax 0.1.4
orbax 0.1.2
packaging 23.0
pandas 1.5.3
Pillow 9.4.0
pip 22.3.1
platformdirs 3.0.0
portalocker 2.7.0
protobuf 3.19.6
PuLP 2.7.0
pyarrow 11.0.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
Pygments 2.14.0
pyparsing 3.0.9
pyrsistent 0.19.3
python-dateutil 2.8.2
pytz 2022.7.1
PyYAML 6.0
ray 2.1.0
redis 4.5.1
regex 2022.10.31
requests 2.28.2
requests-oauthlib 1.3.1
responses 0.18.0
rich 13.3.1
rsa 4.9
sacrebleu 2.3.1
scikit-learn 1.2.1
scipy 1.10.1
setuptools 65.6.3
six 1.16.0
tabulate 0.9.0
tensorboard 2.12.0
tensorboard-data-server 0.7.0
tensorboard-plugin-wit 1.8.1
tensorflow-estimator 2.7.0
tensorflow-gpu 2.7.0
tensorflow-io-gcs-filesystem 0.30.0
tensorstore 0.1.32
termcolor 2.2.0
threadpoolctl 3.1.0
tokenizers 0.13.2
toolz 0.12.0
tqdm 4.64.1
transformers 4.26.1
typing_extensions 4.5.0
urllib3 1.26.14
virtualenv 20.19.0
Werkzeug 2.2.3
wheel 0.38.4
wrapt 1.14.1
xxhash 3.2.0
yarl 1.8.2
zipp 3.14.0
What should I do? thanks.
when i use pipeline parallel, other error occurred:
2023-02-28 11:24:36,070 ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.load_opt_params_worker_func() (pid=12770, ip=10.xx.2.46, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fe178157b20>)
File "/home/guodong.li/code/alpa/examples/opt_finetune/load_params.py", line 147, in load_opt_params_worker_func
load_array("decoder.embed_tokens.weight"))
File "/home/guodong.li/code/alpa/examples/opt_finetune/load_params.py", line 121, in load_array
return np.load(os.path.join(path, key))
File "/home/guodong.li/virtual-venv/alpa-venv-py39/lib/python3.9/site-packages/numpy/lib/npyio.py", line 405, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/data/nfs/guodong.li/pretrain/opt-2.7b/decoder.embed_tokens.weight'
command:
PYTHONPATH=/home/guodong.li/code/alpa/ python3 run_clm_flax.py \
--output_dir="/home/guodong.li/data/train_result/output_opt" \
--config_name="./config_30b.json" \
--tokenizer_name="facebook/opt-30b" \
--alpa_init \
--use_manual_layer \
--dataset_name="wikitext" \
--dataset_config_name="wikitext-2-raw-v1" \
--do_train \
--block_size="1024" \
--per_device_train_batch_size="1024" \
--per_device_eval_batch_size="64" \
--num_micro_batches 256 \
--operator_parallel 1 \
--pipeline_parallel 8 \
--dtype="float16" \
--learning_rate="5e-4" --warmup_steps="2000" \
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
--overwrite_output_dir \
--num_train_epochs="10" \
--logging_steps="1" \
--save_steps="888" \
--eval_steps="888"
When I check the configuration file, I find this file is missing?

Where should I download it?@zhisbug
best wishes!