OpenChatKit icon indicating copy to clipboard operation
OpenChatKit copied to clipboard

how to identify the process of training?

Open joydchh opened this issue 1 year ago • 5 comments

I started a training process with 4*V100S(32GB VRAM each) at 18:00, and i got a "training starts..." prompt. With nvidia-smi, i can see that 3 GPUs are running with utils 100%. The next morning, the processes are still running, but nothing in output folder, neither the log message. So, is there someway to see how the training job is going?

joydchh avatar Mar 24 '23 03:03 joydchh

@LorrinWWW, any thoughts?

csris avatar Mar 24 '23 03:03 csris

There should be log messages during training. I feel the rank 0 was down so the other three were waiting for it. Can you post the full log message? @joydchh

LorrinWWW avatar Mar 24 '23 04:03 LorrinWWW

There should be log messages during training. I feel the rank 0 was down so the other three were waiting for it. Can you post the full log message? @joydchh

Here is the full log, including some exceptions, not sure if it's really doing the training job.

nohup: ignoring input Using pad_token, but it is not set yet. Using pad_token, but it is not set yet. Using pad_token, but it is not set yet. Using pad_token, but it is not set yet. Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0 comm init done!! token vocab size: 50432 data_utils: parse task_list data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_ni.jsonl 0.2 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_p3.jsonl 0.5 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_flan.jsonl 0.2 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_chip2.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_rallio_safety_and_prosocial.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_soda_dialog.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_unifiedskg_instructions.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_merged_code_xp3.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_oscar_en_sample_dialog.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_ul2_plus_oscar_en_sample_dialog.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_multi_news.jsonl 0.05 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_openai_summarize_tldr.jsonl 0.05 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_squad_v2.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_nq.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_poetry_instructions.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_sqlv2.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_unnatural_instructions.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_conv_finqa.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_essays.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_plot_screenplay_books_dialog.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_grade_school_math_instructions.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_mathqa_flanv2_kojma_cot.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_joke_explanations.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_cuad.jsonl 0.01 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_abstract_infill.jsonl 0.1 data_utils: /data/OpenChatKit/training/../data/OIG/files/unified_image_prompts_instructions.jsonl 0.01 data_utils: get train_data_loader Running gpipe without data parallel. =======Initialize Gpipe. =======Gpipe use FP16 =======Gradient accumulate step: 1 =======Current micro-batch send/recv size: 24 MB (fp16) =======Number of micro-batches: 64. loading embs loading layer 0 loading layer 1 loading layer 2 loading layer 3 loading layer 4 loading layer 5

using Adam fp16 uses DynamicGradScaler. no checkpoint available, skipping training starts...... Failed to read file '/data/OpenChatKit/data/OIG/files/unified_oscar_en_sample_dialog.jsonl' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0 Traceback (most recent call last): File "/data/OpenChatKit/training/dist_clm_train.py", line 358, in main() File "/data/OpenChatKit/training/dist_clm_train.py", line 332, in main train_loop(args, pipe, device, train_data_loader, test_data_loader) File "/data/OpenChatKit/training/dist_clm_train.py", line 100, in train_loop for i, data in enumerate(train_data_loader): File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise raise exception pyarrow.lib.ArrowInvalid: Caught ArrowInvalid in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 152, in _generate_tables dataset = json.load(f) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/json/init.py", line 293, in load return loads(fp.read(), File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch data.append(next(self.dataset_iter)) File "/data/OpenChatKit/training/tasks/data_loaders/data_utils.py", line 253, in get_sequence inputs = next(it) File "/data/OpenChatKit/training/tasks/data_loaders/data_utils.py", line 195, in get_sequence for x in self.data: File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 934, in iter yield from self._iter_pytorch(ex_iterable) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 867, in _iter_pytorch for key, example in ex_iterable.shard_data_sources(shards_indices): File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 627, in iter for x in self.ex_iterable: File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 113, in iter yield from self.generate_examples_fn(**self.kwargs) File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 763, in wrapper for key, table in generate_tables_fn(**kwargs): File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 155, in _generate_tables raise e File "/home/ubuntu/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 131, in _generate_tables pa_table = paj.read_json( File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0

Here is the GPU states. image

joydchh avatar Mar 24 '23 05:03 joydchh

@joydchh The rank 0 crashed when it tried to read the dataset. Can you check if all data files are prepared in "/data/OpenChatKit/training/../data/OIG/files/"?

And I noticed you have 4*V100S (32GB), but they might not be able to fine-tune a 20B model, but you can try smaller models e.g. EleutherAI/pythia-1.4b-deduped.

LorrinWWW avatar Mar 24 '23 05:03 LorrinWWW

@joydchh The rank 0 crashed when it tried to read the dataset. Can you check if all data files are prepared in "/data/OpenChatKit/training/../data/OIG/files/"?

And I noticed you have 4*V100S (32GB), but they might not be able to fine-tune a 20B model, but you can try smaller models e.g. EleutherAI/pythia-1.4b-deduped.

ok, thanks. seems the files in OIG/ is not downloaded completely. Do you have the suggested configuration for training? what if i changed to a 8*V100S(32GB)? is it enough?

joydchh avatar Mar 24 '23 07:03 joydchh