ProteinLM icon indicating copy to clipboard operation
ProteinLM copied to clipboard

I tried to run, but after start pretraining task, the process kills itself. can you help?

Open usccolumbia opened this issue 3 years ago • 0 comments

roteinlm)xxxx@quant:~/ProteinLM/pretrain$ sh examples/pretrain_tape.sh using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... WARNING: overriding default arguments for tokenizer_type:BertWordPieceLowerCase with tokenizer_type:BertWordPieceCase ------------------------ arguments ------------------------ adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False bert_load ....................................... None bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True block_data_path ................................. None checkpoint_activations .......................... False checkpoint_num_layers ........................... 1 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 data_impl ....................................... mmap data_parallel_size .............................. 1 data_path ....................................... ['my-tape_text_sentence'] DDP_impl ........................................ local distribute_checkpointed_activations ............. False distributed_backend ............................. nccl eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 10 exit_duration_in_mins ........................... None exit_interval ................................... None faiss_use_gpu ................................... False finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_allreduce .................................. False fp32_residual_connection ........................ False global_batch_size ............................... 8 hidden_dropout .................................. 0.1 hidden_size ..................................... 768 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 init_method_std ................................. 0.02 initial_loss_scale .............................. 4294967296 layernorm_epsilon ............................... 1e-12 lazy_mpu_init ................................... None load ............................................ ./checkopoint local_rank ...................................... None log_interval .................................... 100 loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.0001 lr_decay_iters .................................. 990000 lr_decay_samples ................................ None lr_decay_style .................................. linear lr_warmup_fraction .............................. 0.01 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_prob ....................................... 0.15 max_position_embeddings ......................... 2176 merge_file ...................................... None micro_batch_size ................................ 4 min_loss_scale .................................. 1.0 min_lr .......................................... 1e-05 mmap_warmup ..................................... False no_load_optim ................................... False no_load_rng ..................................... False no_save_optim ................................... False no_save_rng ..................................... False num_attention_heads ............................. 12 num_layers ...................................... 12 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False override_lr_scheduler ........................... False params_dtype .................................... torch.float16 pipeline_model_parallel_size .................... 1 query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 report_topk_accuracies .......................... [] reset_attention_mask ............................ False reset_position_ids .............................. False save ............................................ ./checkopoint save_interval ................................... 10000 scaled_masked_softmax_fusion .................... True scaled_upper_triang_masked_softmax_fusion ....... None seed ............................................ 1234 seq_length ...................................... 2176 short_seq_prob .................................. 0.1 split ........................................... 32593668,1715454,44311 tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None titles_data_path ................................ None tokenizer_type .................................. BertWordPieceCase train_iters ..................................... 2000000 train_samples ................................... None use_checkpoint_lr_scheduler ..................... False use_cpu_initialization .......................... False use_one_sent_docs ............................... False vocab_file ...................................... ./protein_tools/iupac_vocab.txt weight_decay .................................... 0.01 world_size ...................................... 1 -------------------- end of arguments --------------------- setting number of micro-batches to constant 2

building BertWordPieceCase tokenizer ... padded vocab (size: 31) with 97 dummy tokens (new size: 128) initializing torch distributed ... initializing tensor model parallel with size 1 initializing pipeline model parallel with size 1 setting random seeds to 1234 ... initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 time to initialize megatron (seconds): 74.673 [after megatron is initialized] datetime: 2022-02-09 00:02:02 building TAPE model ... number of parameters on (tensor, pipeline) model parallel rank (0, 0): 87417728 learning rate decay style: linear WARNING: could not find the metadata file ./checkopoint/latest_checkpointed_iteration.txt will not load any checkpoints and will start from random time (ms) | load checkpoint: 10.21 [after model, optimizer, and learning rate scheduler are built] datetime: 2022-02-09 00:02:02 building train, validation, and test datasets ... datasets target sizes (minimum size): train: 16000000 validation: 160080 test: 80 building train, validation, and test datasets for TAPE ... building dataset index ... reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... finished creating indexed dataset in 0.013824 seconds number of documents: 32593668 dataset split: train: document indices in [0, 30924048) total of 30924048 documents validation: document indices in [30924048, 32551627) total of 1627579 documents test: document indices in [32551627, 32593668) total of 42041 documents WARNING: could not find index map files, building the indices on rank 0 ... last epoch number of samples (26365) is larger than 80% of number of samples per epoch (28422), setting separate_last_epoch to False Killed

usccolumbia avatar Feb 09 '22 05:02 usccolumbia