OpenChatKit
OpenChatKit copied to clipboard
how to identify the process of training?
I started a training process with 4*V100S(32GB VRAM each) at 18:00, and i got a "training starts..." prompt. With nvidia-smi, i can see that 3 GPUs are running with utils 100%. The next morning, the processes are still running, but nothing in output folder, neither the log message. So, is there someway to see how the training job is going?
@LorrinWWW, any thoughts?
There should be log messages during training. I feel the rank 0 was down so the other three were waiting for it. Can you post the full log message? @joydchh
There should be log messages during training. I feel the rank 0 was down so the other three were waiting for it. Can you post the full log message? @joydchh
Here is the full log, including some exceptions, not sure if it's really doing the training job.
nohup: ignoring input Using pad_token, but it is not set yet. Using pad_token, but it is not set yet. Using pad_token, but it is not set yet. Using pad_token, but it is not set yet. Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0 comm init done!! token vocab size: 50432 data_utils: parse task_list data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_ni.jsonl 0.2 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_p3.jsonl 0.5 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_flan.jsonl 0.2 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_chip2.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_rallio_safety_and_prosocial.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_soda_dialog.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_unifiedskg_instructions.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_merged_code_xp3.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_oscar_en_sample_dialog.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_ul2_plus_oscar_en_sample_dialog.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_multi_news.jsonl 0.05 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_openai_summarize_tldr.jsonl 0.05 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_squad_v2.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_nq.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_poetry_instructions.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_sqlv2.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_unnatural_instructions.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_conv_finqa.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_essays.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_plot_screenplay_books_dialog.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_grade_school_math_instructions.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_mathqa_flanv2_kojma_cot.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_joke_explanations.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_cuad.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_abstract_infill.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_image_prompts_instructions.jsonl 0.01 data_utils: get train_data_loader Running gpipe without data parallel. =======Initialize Gpipe. =======Gpipe use FP16 =======Gradient accumulate step: 1 =======Current micro-batch send/recv size: 24 MB (fp16) =======Number of micro-batches: 64. loading embs loading layer 0 loading layer 1 loading layer 2 loading layer 3 loading layer 4 loading layer 5
using Adam fp16 uses DynamicGradScaler. no checkpoint available, skipping training starts...... Failed to read file '/data/OpenChatKit/data/OIG/files/unified_oscar_en_sample_dialog.jsonl' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0 Traceback (most recent call last): File "/data/OpenChatKit/training/dist_clm_train.py", line 358, in
main() File "/data/OpenChatKit/training/dist_clm_train.py", line 332, in main train_loop(args, pipe, device, train_data_loader, test_data_loader) File "/data/OpenChatKit/training/dist_clm_train.py", line 100, in train_loop for i, data in enumerate(train_data_loader): File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise raise exception pyarrow.lib.ArrowInvalid: Caught ArrowInvalid in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 152, in _generate_tables dataset = json.load(f) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/json/init.py", line 293, in load return loads(fp.read(), File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch data.append(next(self.dataset_iter)) File "/data/OpenChatKit/training/tasks/data_loaders/data_utils.py", line 253, in get_sequence inputs = next(it) File "/data/OpenChatKit/training/tasks/data_loaders/data_utils.py", line 195, in get_sequence for x in self.data: File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 934, in iter yield from self._iter_pytorch(ex_iterable) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 867, in _iter_pytorch for key, example in ex_iterable.shard_data_sources(shards_indices): File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 627, in iter for x in self.ex_iterable: File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 113, in iter yield from self.generate_examples_fn(**self.kwargs) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 763, in wrapper for key, table in generate_tables_fn(**kwargs): File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 155, in _generate_tables raise e File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 131, in _generate_tables pa_table = paj.read_json( File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0
Here is the GPU states.
@joydchh The rank 0 crashed when it tried to read the dataset. Can you check if all data files are prepared in "/data/OpenChatKit/training/../data/OIG/files/"?
And I noticed you have 4*V100S (32GB), but they might not be able to fine-tune a 20B model, but you can try smaller models e.g. EleutherAI/pythia-1.4b-deduped
.
@joydchh The rank 0 crashed when it tried to read the dataset. Can you check if all data files are prepared in "/data/OpenChatKit/training/../data/OIG/files/"?
And I noticed you have 4*V100S (32GB), but they might not be able to fine-tune a 20B model, but you can try smaller models e.g.
EleutherAI/pythia-1.4b-deduped
.
ok, thanks. seems the files in OIG/ is not downloaded completely. Do you have the suggested configuration for training? what if i changed to a 8*V100S(32GB)? is it enough?