nebuly
nebuly copied to clipboard
[Chatllama] train actor model with llama7B, the loss is nan
I mannuly split the model checkpoint into 8 splits. and train the llama model with 8 V100 GPUs. but strangely, the loss is nan. I trained successfully with same data on gpt2-xl model, so I think that's not a data problem. can anybody figure out why?
ubuntu@ip-172-31-10-190:~/ubuntu$ torchrun --standalone --nnodes=1 --nproc-per-node=8 artifacts/main.py artifacts/config/config.yaml --type ACTOR master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " Current device used :cuda Start cleaning the dataset for Actor Current device used :cuda Current device used :cuda Start cleaning the dataset for Actor Start cleaning the dataset for Actor Current device used :cuda Current device used :cuda Start cleaning the dataset for Actor Current device used :cuda Current device used :cuda Current device used :cuda Start cleaning the dataset for Actor Start cleaning the dataset for Actor Start cleaning the dataset for Actor Start cleaning the dataset for Actor Dataset is already clean Dataset is already clean Dataset is already clean local_rank: 4 world_size: 8 Dataset is already clean Dataset is already clean Dataset is already clean Dataset is already clean local_rank: 0 world_size: 8 local_rank: 6 world_size: 8 local_rank: 5 world_size: 8 local_rank: 1 world_size: 8 Dataset is already clean local_rank: 2 world_size: 8 local_rank: 7 world_size: 8 local_rank: 3 world_size: 8
initializing model parallel with size 8 initializing ddp with size 1 initializing pipeline with size 1 Loading Loading Loading Loading Loading Loading Loading Loading No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt [2023-03-31 03:50:04,515] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown [2023-03-31 03:50:04,533] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown [2023-03-31 03:50:04,600] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown [2023-03-31 03:50:04,614] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown [2023-03-31 03:50:04,655] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown [2023-03-31 03:50:04,685] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown [2023-03-31 03:50:04,686] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown [2023-03-31 03:50:04,750] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown [2023-03-31 03:50:06,138] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-03-31 03:50:06,138] [INFO] [logging.py:77:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer [2023-03-31 03:50:06,138] [INFO] [logging.py:77:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2023-03-31 03:50:06,152] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW [2023-03-31 03:50:06,152] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'> [2023-03-31 03:50:06,152] [INFO] [logging.py:77:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer [2023-03-31 03:50:06,306] [INFO] [utils.py:829:see_memory_usage] Stage 3 initialize beginning [2023-03-31 03:50:06,307] [INFO] [utils.py:830:see_memory_usage] MA 14.55 GB Max_MA 14.55 GB CA 14.57 GB Max_CA 15 GB [2023-03-31 03:50:06,307] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 21.76 GB, percent = 2.9% [2023-03-31 03:50:06,308] [INFO] [stage3.py:113:init] Reduce bucket size 100 [2023-03-31 03:50:06,309] [INFO] [stage3.py:114:init] Prefetch bucket size 0 Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu117/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.25902438163757324 seconds Loading extension module utils... Time to load utils op: 0.10285019874572754 seconds Loading extension module utils... Time to load utils op: 0.20204687118530273 seconds Loading extension module utils... Time to load utils op: 0.20221972465515137 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 0.30254244804382324 seconds Time to load utils op: 0.3023359775543213 seconds Loading extension module utils... Time to load utils op: 0.3025219440460205 seconds Loading extension module utils... Time to load utils op: 0.30247020721435547 seconds [2023-03-31 03:50:07,142] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2023-03-31 03:50:07,143] [INFO] [utils.py:830:see_memory_usage] MA 14.55 GB Max_MA 14.55 GB CA 14.57 GB Max_CA 15 GB [2023-03-31 03:50:07,143] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 22.86 GB, percent = 3.1% Parameter Offload: Total persistent parameters: 0 in 0 params [2023-03-31 03:50:09,626] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2023-03-31 03:50:09,627] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 14.55 GB CA 14.57 GB Max_CA 15 GB [2023-03-31 03:50:09,627] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 38.78 GB, percent = 5.2% [2023-03-31 03:50:09,708] [INFO] [utils.py:829:see_memory_usage] Before creating fp16 partitions [2023-03-31 03:50:09,708] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB [2023-03-31 03:50:09,708] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 38.78 GB, percent = 5.2% [2023-03-31 03:50:12,637] [INFO] [utils.py:829:see_memory_usage] After creating fp16 partitions: 9 [2023-03-31 03:50:12,638] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB [2023-03-31 03:50:12,638] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 57.15 GB, percent = 7.6% [2023-03-31 03:50:12,756] [INFO] [utils.py:829:see_memory_usage] Before creating fp32 partitions [2023-03-31 03:50:12,757] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB [2023-03-31 03:50:12,757] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 58.49 GB, percent = 7.8% [2023-03-31 03:50:15,870] [INFO] [utils.py:829:see_memory_usage] After creating fp32 partitions [2023-03-31 03:50:15,870] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB [2023-03-31 03:50:15,871] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 80.31 GB, percent = 10.7% [2023-03-31 03:50:15,985] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states [2023-03-31 03:50:15,986] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB [2023-03-31 03:50:15,986] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 81.74 GB, percent = 10.9% [2023-03-31 03:50:36,904] [INFO] [utils.py:829:see_memory_usage] After initializing optimizer states [2023-03-31 03:50:36,905] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB [2023-03-31 03:50:36,905] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 167.03 GB, percent = 22.3% [2023-03-31 03:50:38,351] [INFO] [stage3.py:376:_setup_for_real_optimizer] optimizer state initialized Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0007975101470947266 seconds Training with DeepSpeed Start Actor Model Pretraining Looking for checkpoints... No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Time to load utils op: 0.0007760524749755859 seconds Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Training with DeepSpeed No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Start Actor Model Pretraining Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Looking for checkpoints... Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...No modifications detected for re-loaded extension module utils, skipping build step...No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt
Time to load utils op: 0.0008547306060791016 seconds
Loading extension module utils... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Training with DeepSpeed Start Actor Model Pretraining Time to load utils op: 0.0008461475372314453 seconds No modifications detected for re-loaded extension module utils, skipping build step...Looking for checkpoints...
Loading extension module utils... Time to load utils op: 0.0008032321929931641 seconds Training with DeepSpeed Time to load utils op: 0.0009899139404296875 secondsNo previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt
Training with DeepSpeed Start Actor Model Pretraining Looking for checkpoints... Start Actor Model Pretraining Training with DeepSpeed Time to load utils op: 0.0009548664093017578 secondsLooking for checkpoints...
Start Actor Model Pretraining No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt Looking for checkpoints... Training with DeepSpeed No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt Start Actor Model Pretraining Looking for checkpoints... No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt [2023-03-31 03:50:41,801] [INFO] [utils.py:829:see_memory_usage] After initializing ZeRO optimizer [2023-03-31 03:50:41,802] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.49 GB CA 14.82 GB Max_CA 15 GB [2023-03-31 03:50:41,802] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 179.5 GB, percent = 24.0% [2023-03-31 03:50:41,802] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW [2023-03-31 03:50:41,802] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-03-31 03:50:41,802] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.CosineAnnealingWarmRestarts object at 0x7f812de03340> [2023-03-31 03:50:41,802] [INFO] [logging.py:77:log_dist] [Rank 0] step=0, skipped=0, lr=[9e-06], mom=[(0.9, 0.999)] [2023-03-31 03:50:41,803] [INFO] [config.py:1010:print] DeepSpeedEngine configuration: [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] amp_enabled .................. False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] amp_params ................... False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] bfloat16_enabled ............. False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] checkpoint_parallel_write_pipeline False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] checkpoint_tag_validation_enabled True [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] checkpoint_tag_validation_fail False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f812de038e0> [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] communication_data_type ...... None [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] curriculum_enabled_legacy .... False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] curriculum_params_legacy ..... False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] data_efficiency_enabled ...... False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] dataloader_drop_last ......... False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] disable_allgather ............ False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] dump_state ................... False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1} [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_enabled ........... False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_gas_boundary_resolution 1 [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_layer_num ......... 0 [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_max_iter .......... 100 [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_stability ......... 1e-06 [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_tol ............... 0.01 [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_verbose ........... False [2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] elasticity_enabled ........... False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] fp16_auto_cast ............... False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] fp16_enabled ................. True [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] fp16_master_weights_and_gradients False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] global_rank .................. 0 [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] grad_accum_dtype ............. None [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] gradient_accumulation_steps .. 1 [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] gradient_clipping ............ 0.0 [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] gradient_predivide_factor .... 1.0 [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] initial_dynamic_scale ........ 4294967296 [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] load_universal_checkpoint .... False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] loss_scale ................... 0 [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] memory_breakdown ............. False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] optimizer_legacy_fusion ...... False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] optimizer_name ............... adam [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] optimizer_params ............. {'lr': 0.00015} [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] pld_enabled .................. False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] pld_params ................... False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] prescale_gradients ........... False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] scheduler_name ............... None [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] scheduler_params ............. None [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] sparse_attention ............. None [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] sparse_gradients_enabled ..... False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] steps_per_print .............. 10 [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] train_batch_size ............. 8 [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] train_micro_batch_size_per_gpu 1 [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] use_node_local_storage ....... False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] wall_clock_breakdown ......... False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] world_size ................... 8 [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] zero_allow_untested_optimizer False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=100 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=100000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=0 param_persistence_threshold=100 model_persistence_threshold=sys.maxsize max_live_parameters=0 max_reuse_distance=0 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] zero_enabled ................. True [2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] zero_optimization_stage ...... 3 [2023-03-31 03:50:41,806] [INFO] [config.py:999:print_user_config] json = { "gradient_accumulation_steps": 1, "optimizer": { "type": "Adam", "params": { "lr": 0.00015 } }, "zero_force_ds_cpu_optimizer": false, "zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_prefetch_bucket_size": 0, "stage3_param_persistence_threshold": 100, "reduce_bucket_size": 100, "sub_group_size": 1.000000e+08, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "train_batch_size": 8, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false } Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0004968643188476562 seconds Training with DeepSpeed Start Actor Model Pretraining Looking for checkpoints... No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. warnings.warn( /home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead. warnings.warn( Epoch: 1/1, Iteration: 1/160782, Training Loss: nanEpoch: 1/1, Iteration: 1/160782, Training Loss: nan
Epoch: 1/1, Iteration: 1/160782, Training Loss: nan Epoch: 1/1, Iteration: 1/160782, Training Loss: nanEpoch: 1/1, Iteration: 1/160782, Training Loss: nan
Epoch: 1/1, Iteration: 1/160782, Training Loss: nan Epoch: 1/1, Iteration: 1/160782, Training Loss: nan [2023-03-31 03:50:48,116] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296 Epoch: 1/1, Iteration: 1/160782, Training Loss: nan Epoch: 1/1, Iteration: 2/160782, Training Loss: nan Epoch: 1/1, Iteration: 2/160782, Training Loss: nan Epoch: 1/1, Iteration: 2/160782, Training Loss: nan Epoch: 1/1, Iteration: 2/160782, Training Loss: nanEpoch: 1/1, Iteration: 2/160782, Training Loss: nan
Epoch: 1/1, Iteration: 2/160782, Training Loss: nan [2023-03-31 03:50:51,973] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0 Epoch: 1/1, Iteration: 2/160782, Training Loss: nan Epoch: 1/1, Iteration: 2/160782, Training Loss: nan Epoch: 1/1, Iteration: 3/160782, Training Loss: nanEpoch: 1/1, Iteration: 3/160782, Training Loss: nan
Epoch: 1/1, Iteration: 3/160782, Training Loss: nanEpoch: 1/1, Iteration: 3/160782, Training Loss: nan[2023-03-31 03:50:55,374] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
Epoch: 1/1, Iteration: 3/160782, Training Loss: nan Epoch: 1/1, Iteration: 3/160782, Training Loss: nan Epoch: 1/1, Iteration: 3/160782, Training Loss: nan Epoch: 1/1, Iteration: 3/160782, Training Loss: nan Epoch: 1/1, Iteration: 4/160782, Training Loss: nan Epoch: 1/1, Iteration: 4/160782, Training Loss: nan Epoch: 1/1, Iteration: 4/160782, Training Loss: nanEpoch: 1/1, Iteration: 4/160782, Training Loss: nan
Epoch: 1/1, Iteration: 4/160782, Training Loss: nan Epoch: 1/1, Iteration: 4/160782, Training Loss: nanEpoch: 1/1, Iteration: 4/160782, Training Loss: nan
[2023-03-31 03:50:58,870] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0 Epoch: 1/1, Iteration: 4/160782, Training Loss: nan Epoch: 1/1, Iteration: 5/160782, Training Loss: nanEpoch: 1/1, Iteration: 5/160782, Training Loss: nan
[2023-03-31 03:51:02,331] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0 Epoch: 1/1, Iteration: 5/160782, Training Loss: nan Epoch: 1/1, Iteration: 5/160782, Training Loss: nan Epoch: 1/1, Iteration: 5/160782, Training Loss: nan Epoch: 1/1, Iteration: 5/160782, Training Loss: nan Epoch: 1/1, Iteration: 5/160782, Training Loss: nan Epoch: 1/1, Iteration: 5/160782, Training Loss: nan [2023-03-31 03:51:05,827] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0 Epoch: 1/1, Iteration: 6/160782, Training Loss: nan Epoch: 1/1, Iteration: 6/160782, Training Loss: nan Epoch: 1/1, Iteration: 6/160782, Training Loss: nanEpoch: 1/1, Iteration: 6/160782, Training Loss: nan
Epoch: 1/1, Iteration: 6/160782, Training Loss: nan Epoch: 1/1, Iteration: 6/160782, Training Loss: nanEpoch: 1/1, Iteration: 6/160782, Training Loss: nan
Epoch: 1/1, Iteration: 6/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
[2023-03-31 03:51:09,498] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Traceback (most recent call last):
File "artifacts/main.py", line 61, in