torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

train llama3 error

Open starstream opened this issue 6 months ago • 2 comments

Root Cause (first observed failure): [0]: time : 2024-08-05_10:01:43 host : iZuf6ct0ygsd4zjh2lit8uZ rank : 0 (local_rank: 0) exitcode : 1 (pid: 46669) error_file: /tmp/torchelastic_i4d4ivao/none_jzj2c4lc/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper return f(*args, **kwargs) File "/ncluster/dushuai/torchtitan/train.py", line 207, in main tokenizer = create_tokenizer(tokenizer_type, job_config.model.tokenizer_path) File "/ncluster/dushuai/torchtitan/torchtitan/datasets/tokenizer/init.py", line 19, in create_tokenizer return TikTokenizer(tokenizer_path) File "/ncluster/dushuai/torchtitan/torchtitan/datasets/tokenizer/tiktoken.py", line 52, in init mergeable_ranks = load_tiktoken_bpe(model_path) File "/usr/local/lib/python3.10/dist-packages/tiktoken/load.py", line 148, in load_tiktoken_bpe return { File "/usr/local/lib/python3.10/dist-packages/tiktoken/load.py", line 149, in base64.b64decode(token): int(rank) ValueError: invalid literal for int() with base 10: b'coding=utf-8'

starstream avatar Aug 05 '24 10:08 starstream