metaseq Model_parallel=2 and 2 gpus on FAIR cluster: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking arugment for argument index in method wrapper_index

Model_parallel=2 and 2 gpus on FAIR cluster: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking arugment for argument index in method wrapper_index_select)

Open tbmihailov opened this issue 2 years ago • 7 comments

🐛 Bug

Running a model parallel with 2 gpus on FAIR cluster raises the following exception with the 1.3B_gptz model: UPDATE: When we use model_parallel=2 and 8 gpus, this works, but it should not fail with 2 gpus.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument index in method wrapper_index_select)

I found that there is a warning in the log which might be giving a clue about the problem -- the full log is at the bottom of the issue.

WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Login to FAIR cluster
The environment is setup with the following steps

apex commit: e1aa1fc1316a84e66869666270941265ec9cde77
fairscale commit: 1bc96fa8c69def6d990e42bfbd75f86146ce29bd
megatron: --branch fairseq_v2
metaseq - git checkout tbmihaylov/gshard-eval-script - this is rebased from main with added the model (below)

Model - fresh copy of the 1.3B_gptz from azure:

UNIDIR_LM_ROBERTA_DATA = {
# ...
"1.3B_gptz_model_parallel": gptz_sharded_config(
        "/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt",
        model_parallel_size=2
    ),
# ...
}

Slurm allocation

srun --gpus=2 --nodes 1 --ntasks-per-node 1 --cpus-per-task 10 --mem 58G --constraint volta32gb --time 1440 --partition xlmg,devaccel,learnaccel --pty bash

Command:

export RUN_MODEL_NAME=1.3B_gptz_model_parallel
python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME}  --tasks cb --nb-few-shot-samples-values 0 --max-positions 1024 --train-sep ' ' --scoring mean --fsdp  --distributed-world-size 2

See the error in the log (at the end of this issue).

Expected behavior

Not failing in the given configuration.

Environment

Explained in the repro

Additional context

Full error log:

(metaseq_20220328) tbmihaylov@learnfair1844:~/metaseq-internal$ python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME}  --tasks cb --nb-few-shot-samples-values 0 --max-positions 1024 --train-sep ' ' --scoring mean --fsdp  --distributed-world-size 2 | tee debug.log
model_name=1.3B_gptz_model_parallel
args:Namespace(add_bos_token=False, all_gather_list_size=16384, azureml_logging=False, batch_size=None, batch_size_valid=None, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, combine_valid_subsets=None, context_window=0, cpu=False, cpu_offload=False, criterion='cross_entropy', data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='pytorch_ddp', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=10791, distributed_rank=0, distributed_world_size=2, dont_log_param_and_grad_norm=False, empty_cache_freq=0, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, future_target=False, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=False, log_file=None, log_format=None, log_interval=100, log_nvidia_smi=False, lr_scheduler='fixed', max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_valid_steps=None, memory_efficient_fp16=True, min_loss_scale=0.0001, model_overrides='{}', model_parallel_size=1, new_profiler=False, no_progress_bar=False, no_reshard_after_forward=False, no_seed_provided=False, num_shards=1, num_workers=1, num_workers_valid=0, optimizer=None, output_dictionary_size=-1, output_word_probs=False, output_word_stats=False, pad_to_fixed_bsz=False, pad_to_fixed_length=False, past_target=False, path=None, plasma_path='/tmp/plasma', profile=False, required_batch_size_multiple=8, results_path=None, sample_break_mode='none', score_sequences=False, seed=1, self_target=False, shard_id=0, shorten_data_split_list='', shorten_method='none', shuffle_docs=False, skip_invalid_size_inputs_valid_test=False, softmax_batch=9223372036854775807, task='language_modeling', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=1024, train_subset='train', use_plasma_view=False, use_sharded_state=True, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_init_lr=-1, warmup_updates=4000, zero_sharding='none')
model_config:{'model_path': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt', 'extra_args': ['--use-sharded-state', '--memory-efficient-fp16', '--fp16', '--distributed-port', '10791', '--ddp-backend', 'fully_sharded'], 'model_overrides': {'bpe': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True, 'specify_arch': True, 'batch_size': None, 'batch_size_valid': None}, 'model_parallel_size': 2, 'distributed_world_size': 2}
fairseq_cfg.common.model_parallel_size:2
distributed_training.distributed_port=10791
> initializing tensor model parallel with size 2
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
INFO:fairseq.checkpoint_utils:Done loading state dict
INFO:fairseq.models.fairseq_model:{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': '/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 4, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 2, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False, 'use_tutel_moe': False, 'new_profiler': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 64, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://hpc-pg0-132:18422', 'distributed_port': 18422, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': True, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None}, 'dataset': {'_name': None, 'num_workers': 8, 'num_workers_valid': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': None, 'required_batch_size_multiple': 1, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': True, 'validate_interval': 1, 'validate_interval_updates': 1000, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 286102, 'stop_time_hours': 0.0, 'clip_norm': 1.0, 'clip_norm_type': 'l2', 'skip_gradient_update_on_clip_norm': False, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0002], 'stop_min_lr': -1.0, 'use_bmuf': False, 'train_with_epoch_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 1000, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_best_checkpoints': True, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '-model_part-0', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': True, 's3_upload_path': 'https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', 'model_parallel_size': 2}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 64}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='transformer_lm_megatron', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=2048, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7fd829da7a60>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'task': {'_name': 'streaming_language_modeling', 'data': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'end_of_document_symbol': '</s>', 'sample_break_mode': 'none', 'tokens_per_sample': 2048, 'max_source_positions': None, 'max_target_positions': None, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'data_buffer_size': 10, 'tpu': False, 'update_freq': [1]}, 'criterion': Namespace(_name='vocab_parallel_cross_entropy', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7fd829da7a60>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.95)', 'adam_eps': 1e-08, 'weight_decay': 0.1, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0002], 'block_wise': False}, 'lr_scheduler': {'_name': 'polynomial_decay', 'warmup_updates': 357, 'force_anneal': None, 'end_learning_rate': 2e-05, 'zero_lr_warmup_steps': 0, 'power': 1.0, 'total_num_update': 286102.0, 'lr': [0.0002]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': {'_name': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True}, 'tokenizer': None, 'simul_type': None}
Loading extension module fused_mix_prec_layer_norm_cuda...
name decoder.embed_tokens.weight parameters Parameter containing:
tensor([[ 0.0014, -0.0082, -0.0032,  ..., -0.0111,  0.0054,  0.0015],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0050,  0.0010,  0.0044,  ...,  0.0003, -0.0001, -0.0035],
        ...,
        [ 0.0159,  0.0042,  0.0066,  ...,  0.0044,  0.0008, -0.0086],
        [-0.0008,  0.0032, -0.0032,  ..., -0.0060,  0.0036,  0.0086],
        [-0.0092, -0.0037, -0.0013,  ...,  0.0073,  0.0092, -0.0132]],
       requires_grad=True)
name decoder.embed_positions.weight parameters Parameter containing:
tensor([[-7.6732e-03, -5.4649e-03, -4.2956e-03,  ...,  7.5325e-03,
          7.7163e-03,  1.0300e-02],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [-2.3755e-03,  2.4894e-03,  1.4279e-05,  ..., -8.2043e-03,
         -1.8271e-02,  3.9899e-03],
        ...,
        [-9.6320e-03, -8.2788e-03, -4.1433e-03,  ..., -6.7774e-03,
          6.1964e-03, -5.3095e-03],
        [-4.4763e-03,  1.4532e-02, -6.0640e-04,  ...,  1.5341e-03,
         -1.8106e-03, -5.6959e-04],
        [ 3.7042e-03,  5.2186e-03, -1.1615e-02,  ..., -1.0039e-02,
         -8.7586e-04,  7.5653e-03]], requires_grad=True)
name decoder.layers.0._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0059,  0.0019, -0.0075,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.1._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0023, -0.0028,  0.0170,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.2._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0030, -0.0005,  0.0028,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.3._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0077, -0.0097,  0.0007,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.4._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0011,  0.0143, -0.0066,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.5._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0025, -0.0069,  0.0071,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.6._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0017, -0.0018,  0.0052,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.7._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0046, -0.0019, -0.0044,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.8._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0011, 0.0047, 0.0105,  ..., 0.0000, 0.0000, 0.0000],
       requires_grad=True)
name decoder.layers.9._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0011, 0.0014, 0.0070,  ..., 0.0000, 0.0000, 0.0000],
       requires_grad=True)
name decoder.layers.10._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0068,  0.0033, -0.0046,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.11._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0017,  0.0013,  0.0011,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.12._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-2.7278e-03,  7.8808e-03,  6.6479e-05,  ...,  0.0000e+00,
         0.0000e+00,  0.0000e+00], requires_grad=True)
name decoder.layers.13._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0012,  0.0047, -0.0049,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.14._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0065, 0.0002, 0.0080,  ..., 0.0000, 0.0000, 0.0000],
       requires_grad=True)
name decoder.layers.15._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0017, -0.0017,  0.0030,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.16._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0025,  0.0132, -0.0027,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.17._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0027,  0.0103, -0.0090,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.18._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0067, -0.0047,  0.0028,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.19._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0075,  0.0114, -0.0037,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.20._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0069,  0.0069,  0.0075,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.21._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0037, 0.0070, 0.0135,  ..., 0.0000, 0.0000, 0.0000],
       requires_grad=True)
name decoder.layers.22._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0019,  0.0082, -0.0061,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.23._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0134, 0.0073, 0.0100,  ..., 0.0000, 0.0000, 0.0000],
       requires_grad=True)
name decoder.layer_norm.weight parameters Parameter containing:
tensor([1., 1., 1.,  ..., 1., 1., 1.], requires_grad=True)
name decoder.layer_norm.bias parameters Parameter containing:
tensor([0., 0., 0.,  ..., 0., 0., 0.], requires_grad=True)
Loaded model
model_loading_time=41.0 seconds
model_loading_time_cuda=41.6 seconds
Inferring max tokens for model...
Traceback (most recent call last):
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 893, in <module>
    cli_main()
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 56, in cli_main
    run_evaluations_from_model_name(**vars(args))
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 320, in run_evaluations_from_model_name
    results = load_lm_and_run_func(run_evaluations, model_name, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 178, in load_lm_and_run_func
    distributed_utils.call_main(
  File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 215, in call_main
    torch.multiprocessing.spawn(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 199, in distributed_main
    main(cfg, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 261, in _load_lm_and_run_func
    max_tokens = get_or_infer_max_tokens(model, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 378, in get_or_infer_max_tokens
    return infer_max_tokens_before_oom(model)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 416, in infer_max_tokens_before_oom
    while not is_max_tokens_oom(candidate_max_tokens):
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 409, in is_max_tokens_oom
    raise e
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 405, in is_max_tokens_oom
    model.score(input_texts, batch_size=local_bsz, batch_by_size=False)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/hub_utils.py", line 198, in score
    for hypos in self.generate(
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/hub_utils.py", line 253, in generate
    translations = self.task.inference_step(
  File "/private/home/tbmihaylov/metaseq/fairseq/tasks/language_modeling_inference_for_models_trained_with_streaming.py", line 387, in inference_step
    return generator.generate(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/sequence_scorer.py", line 63, in generate
    decoder_out = model(**net_input)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1403, in forward
    outputs = self.module(*args, **kwargs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/misc/flatten_params_wrapper.py", line 487, in forward
    return self.module(*inputs, **kwinputs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/models/fairseq_model.py", line 373, in forward
    return self.decoder(src_tokens, **kwargs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 643, in forward
    x, extra = self.extract_features(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 668, in extract_features
    return self.extract_features_scriptable(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 706, in extract_features_scriptable
    x, tok, pos = self.forward_embedding(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 575, in forward_embedding
    positions = self.embed_positions(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/modules/learned_positional_embedding.py", line 53, in forward
    return F.embedding(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/functional.py", line 2043, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking arugment for argument index in method wrapper_index_select)

Mar 31 '22 20:03 tbmihailov

can you specify --max-tokens?

Mar 31 '22 20:03 anj-s

can you specify --max-tokens?

It does not help! Same error at inference..

nb_few_shot_samples=0
expected_max_tgt_len=285, max_positions=1024
Average number of train samples: 0.00
Predicting 56 samples with 168 prompts..
Before running model, bs=1, max_tgt_len=284 mem=1.23GB
Traceback (most recent call last):
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 893, in <module>
    cli_main()
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 56, in cli_main
    run_evaluations_from_model_name(**vars(args))
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 320, in run_evaluations_from_model_name
    results = load_lm_and_run_func(run_evaluations, model_name, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 178, in load_lm_and_run_func
    distributed_utils.call_main(
  File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 215, in call_main
    torch.multiprocessing.spawn(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 199, in distributed_main
    main(cfg, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 272, in _load_lm_and_run_func
    return_value = func(model=model, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 478, in run_evaluations
    for metric, score in run_evaluation(
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 844, in run_evaluation
    eval_predictions, metrics_scores = predictor.predict(eval_samples)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/predictors.py", line 990, in predict
    return self.predict_without_calibration(samples)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/predictors.py", line 1060, in predict_without_calibration
    predictions = self.predict_outputs(samples)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/predictors.py", line 982, in predict_outputs
    return self.score_candidates(samples)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/predictors.py", line 884, in score_candidates
    local_hypotheses = self.model.generate(
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/hub_utils.py", line 253, in generate
    translations = self.task.inference_step(
  File "/private/home/tbmihaylov/metaseq/fairseq/tasks/language_modeling_inference_for_models_trained_with_streaming.py", line 387, in inference_step
    return generator.generate(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/sequence_scorer.py", line 63, in generate
    decoder_out = model(**net_input)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1403, in forward
    outputs = self.module(*args, **kwargs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/misc/flatten_params_wrapper.py", line 487, in forward
    return self.module(*inputs, **kwinputs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/models/fairseq_model.py", line 373, in forward
    return self.decoder(src_tokens, **kwargs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 643, in forward
    x, extra = self.extract_features(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 668, in extract_features
    return self.extract_features_scriptable(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 706, in extract_features_scriptable
    x, tok, pos = self.forward_embedding(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 575, in forward_embedding
    positions = self.embed_positions(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/modules/learned_positional_embedding.py", line 53, in forward
    return F.embedding(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/functional.py", line 2043, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking arugment for argument index in method wrapper_index_select)

Mar 31 '22 21:03 tbmihailov

Can you try running this on 8 GPUs? I have a hunch of what might be wrong...

Mar 31 '22 21:03 anj-s

Model config:

    "1.3B_gptz_model_parallel": gptz_sharded_config(
        "/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt",
        model_parallel_size=8
    ),

Alloc:

srun --gpus=8 --nodes 1 --ntasks-per-node 1 --cpus-per-task 10 --mem-per-gpu 58G \
--constraint volta32gb --time 1440 --partition xlmg,devaccel,learnaccel --pty bash

Command:

export RUN_MODEL_NAME=1.3B_gptz_model_parallel
python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME}  --tasks cb --nb-few-shot-samples-values 0 \
--max-positions 1024 --train-sep ' ' --scoring mean --fsdp  --distributed-world-size 8 --max-tokens 1024

Error:

(metaseq_20220328) tbmihaylov@learnfair1855:~/metaseq-internal$ python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME}  --tasks cb --nb-few-shot-samples-values 0 --max-positions 1024 --train-sep ' ' --scoring mean --fsdp  --distributed-world-size 8 --max-tokens 1024
model_name=1.3B_gptz_model_parallel
args:Namespace(add_bos_token=False, all_gather_list_size=16384, azureml_logging=False, batch_size=None, batch_size_valid=None, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, combine_valid_subsets=None, context_window=0, cpu=False, cpu_offload=False, criterion='cross_entropy', data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='pytorch_ddp', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=11485, distributed_rank=0, distributed_world_size=8, dont_log_param_and_grad_norm=False, empty_cache_freq=0, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, future_target=False, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=False, log_file=None, log_format=None, log_interval=100, log_nvidia_smi=False, lr_scheduler='fixed', max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_valid_steps=None, memory_efficient_fp16=True, min_loss_scale=0.0001, model_overrides='{}', model_parallel_size=1, new_profiler=False, no_progress_bar=False, no_reshard_after_forward=False, no_seed_provided=False, num_shards=1, num_workers=1, num_workers_valid=0, optimizer=None, output_dictionary_size=-1, output_word_probs=False, output_word_stats=False, pad_to_fixed_bsz=False, pad_to_fixed_length=False, past_target=False, path=None, plasma_path='/tmp/plasma', profile=False, required_batch_size_multiple=8, results_path=None, sample_break_mode='none', score_sequences=False, seed=1, self_target=False, shard_id=0, shorten_data_split_list='', shorten_method='none', shuffle_docs=False, skip_invalid_size_inputs_valid_test=False, softmax_batch=9223372036854775807, task='language_modeling', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=1024, train_subset='train', use_plasma_view=False, use_sharded_state=True, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_init_lr=-1, warmup_updates=4000, zero_sharding='none')
model_config:{'model_path': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt', 'extra_args': ['--use-sharded-state', '--memory-efficient-fp16', '--fp16', '--distributed-port', '11485', '--ddp-backend', 'fully_sharded'], 'model_overrides': {'bpe': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True, 'specify_arch': True, 'batch_size': None, 'batch_size_valid': None}, 'model_parallel_size': 8, 'distributed_world_size': 8}
fairseq_cfg.common.model_parallel_size:8
distributed_training.distributed_port=11485
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
Traceback (most recent call last):
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 893, in <module>
    cli_main()
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 56, in cli_main
    run_evaluations_from_model_name(**vars(args))
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 320, in run_evaluations_from_model_name
    results = load_lm_and_run_func(run_evaluations, model_name, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 178, in load_lm_and_run_func
    distributed_utils.call_main(
  File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 215, in call_main
    torch.multiprocessing.spawn(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 5 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 199, in distributed_main
    main(cfg, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 248, in _load_lm_and_run_func
    model = load_and_get_model(fairseq_cfg, config, fsdp=kwargs.get("fsdp", False))
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 232, in load_and_get_model
    return get_model(
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 297, in get_model
    model = BaseFairseqModel.from_pretrained(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/fairseq_model.py", line 247, in from_pretrained
    x = hub_utils.from_pretrained(
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/hub_utils.py", line 84, in from_pretrained
    models, args, task = checkpoint_utils.load_model_ensemble_and_task(
  File "/private/home/tbmihaylov/metaseq/fairseq/checkpoint_utils.py", line 464, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/private/home/tbmihaylov/metaseq/fairseq/checkpoint_utils.py", line 399, in load_checkpoint_to_cpu
    paths_to_load = get_paths_to_load(local_path, suffix="shard")
  File "/private/home/tbmihaylov/metaseq/fairseq/checkpoint_utils.py", line 339, in get_paths_to_load
    if not _is_checkpoint_sharded(checkpoint_files):
  File "/private/home/tbmihaylov/metaseq/fairseq/checkpoint_utils.py", line 330, in _is_checkpoint_sharded
    size_ratio = max(sizes) / min(sizes)
ValueError: max() arg is an empty sequence

Mar 31 '22 21:03 tbmihailov

Lets try world size = 8 and MP size =2. Can you also confirm model params and inputs are on the same device before the FW pass?

BTW the error that you are getting points to the fact that it does not recognize checkpoint files that match the given configuration. get_paths_to_load(local_path, suffix="shard") should give you a list of files to load. I don't think you can change MP=8 from MP=2 without preprocessing?

Mar 31 '22 22:03 anj-s

The model seems to pass when set the model_parallel to 2 and gpus=8:

Model setting - model_parallel is 2

"1.3B_gptz_model_parallel": gptz_sharded_config(
        "/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt",
        model_parallel_size=8
    ),

Allocation: 8 gpus:

srun --gpus=8 --nodes 1 --ntasks-per-node 1 --cpus-per-task 10 --mem-per-gpu 58G --constraint volta32gb --time 1440 --partition xlmg,devaccel,learnaccel --pty bash

Command:

export RUN_MODEL_NAME=1.3B_gptz_model_parallel
python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME}  --tasks copa --nb-few-shot-samples-values 0 --max-positions 1024 --train-sep ' ' --scoring mean --fsdp  --distributed-world-size 8 --max-tokens 1024

Log:

WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
name decoder.embed_tokens.weight parameters Parameter containing:
tensor([[ 0.0062,  0.0058,  0.0073,  ..., -0.0051,  0.0037, -0.0016],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0048,  0.0071, -0.0014,  ..., -0.0118,  0.0014,  0.0099],
        ...,
        [ 0.0005, -0.0041, -0.0026,  ...,  0.0007,  0.0005,  0.0060],
        [ 0.0007,  0.0065,  0.0017,  ..., -0.0017, -0.0107, -0.0055],
        [ 0.0030,  0.0023,  0.0041,  ..., -0.0042,  0.0066,  0.0032]],
       requires_grad=True)
name decoder.embed_positions.weight parameters Parameter containing:
tensor([[-0.0042,  0.0131,  0.0044,  ..., -0.0044,  0.0018,  0.0006],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0087, -0.0003,  0.0008,  ...,  0.0026, -0.0109, -0.0003],
        ...,
        [-0.0026, -0.0004, -0.0079,  ...,  0.0052,  0.0006, -0.0079],
        [-0.0081, -0.0042, -0.0067,  ...,  0.0046,  0.0073, -0.0107],
        [ 0.0055,  0.0016,  0.0024,  ...,  0.0018, -0.0089,  0.0073]],
       requires_grad=True)
name decoder.layers.0._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0033, -0.0028, -0.0048,  ..., -0.0004, -0.0003,  0.0001],
       requires_grad=True)
name decoder.layers.1._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0035, -0.0044, -0.0042,  ..., -0.0004,  0.0005,  0.0001],
       requires_grad=True)
name decoder.layers.2._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0194,  0.0114, -0.0055,  ..., -0.0003, -0.0015, -0.0011],
       requires_grad=True)
name decoder.layers.3._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 4.0544e-03,  9.1958e-03,  2.6904e-03,  ..., -6.4812e-04,
        -1.0233e-03,  5.0022e-06], requires_grad=True)
name decoder.layers.4._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-5.3713e-03,  2.0740e-03,  1.4280e-02,  ..., -8.0293e-05,
        -4.0192e-04,  2.7878e-04], requires_grad=True)
name decoder.layers.5._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0030,  0.0006,  0.0071,  ...,  0.0019, -0.0018, -0.0002],
       requires_grad=True)
name decoder.layers.6._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0049, -0.0067,  0.0002,  ..., -0.0011,  0.0011,  0.0006],
       requires_grad=True)
name decoder.layers.7._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0075,  0.0028,  0.0054,  ...,  0.0007, -0.0008,  0.0012],
       requires_grad=True)
name decoder.layers.8._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 6.8595e-03, -3.5073e-03,  6.4135e-03,  ...,  1.8837e-04,
        -8.2689e-05, -3.8628e-04], requires_grad=True)
name decoder.layers.9._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 6.5776e-03, -8.7172e-03, -9.7417e-03,  ...,  5.7788e-04,
        -3.7473e-05, -1.4916e-04], requires_grad=True)
name decoder.layers.10._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-4.3757e-03,  3.4703e-03, -9.8669e-05,  ..., -3.5359e-04,
        -7.3773e-04,  4.4415e-04], requires_grad=True)
name decoder.layers.11._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0081,  0.0030,  0.0023,  ...,  0.0002,  0.0002, -0.0009],
       requires_grad=True)
name decoder.layers.12._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0052, -0.0010,  0.0051,  ..., -0.0001,  0.0007,  0.0002],
       requires_grad=True)
name decoder.layers.13._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0025, -0.0007, -0.0045,  ...,  0.0003, -0.0004, -0.0003],
       requires_grad=True)
name decoder.layers.14._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0009,  0.0043,  0.0009,  ...,  0.0007, -0.0005,  0.0011],
       requires_grad=True)
name decoder.layers.15._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0111, -0.0040,  0.0077,  ...,  0.0014,  0.0016, -0.0004],
       requires_grad=True)
name decoder.layers.16._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-4.3467e-03, -2.2726e-03,  7.2352e-03,  ...,  2.1899e-04,
         9.2245e-05,  1.3401e-03], requires_grad=True)
name decoder.layers.17._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0024,  0.0011, -0.0050,  ...,  0.0015, -0.0007,  0.0007],
       requires_grad=True)
name decoder.layers.18._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0024,  0.0004,  0.0124,  ...,  0.0004,  0.0009,  0.0012],
       requires_grad=True)
name decoder.layers.19._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0083,  0.0041,  0.0168,  ...,  0.0003,  0.0001,  0.0006],
       requires_grad=True)
name decoder.layers.20._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0076,  0.0032, -0.0057,  ...,  0.0014, -0.0002,  0.0004],
       requires_grad=True)
name decoder.layers.21._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0033,  0.0011,  0.0023,  ..., -0.0008,  0.0014,  0.0002],
       requires_grad=True)
name decoder.layers.22._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-1.5215e-02,  4.2698e-03, -3.4709e-03,  ..., -3.9693e-05,
        -1.0424e-04,  4.1906e-04], requires_grad=True)
name decoder.layers.23._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0024, -0.0010,  0.0043,  ..., -0.0009, -0.0010,  0.0008],
       requires_grad=True)
name decoder.layer_norm.weight parameters Parameter containing:
tensor([1., 1., 1.,  ..., 1., 1., 1.], requires_grad=True)
name decoder.layer_norm.bias parameters Parameter containing:
tensor([0., 0., 0.,  ..., 0., 0., 0.], requires_grad=True)
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
INFO:fairseq.checkpoint_utils:Done loading state dict
INFO:fairseq.models.fairseq_model:{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': '/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 4, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 2, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False, 'use_tutel_moe': False, 'new_profiler': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 64, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://hpc-pg0-132:18422', 'distributed_port': 18422, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': True, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None}, 'dataset': {'_name': None, 'num_workers': 8, 'num_workers_valid': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': None, 'required_batch_size_multiple': 1, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': True, 'validate_interval': 1, 'validate_interval_updates': 1000, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 286102, 'stop_time_hours': 0.0, 'clip_norm': 1.0, 'clip_norm_type': 'l2', 'skip_gradient_update_on_clip_norm': False, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0002], 'stop_min_lr': -1.0, 'use_bmuf': False, 'train_with_epoch_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 1000, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_best_checkpoints': True, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '-model_part-0', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': True, 's3_upload_path': 'https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', 'model_parallel_size': 2}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 64}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='transformer_lm_megatron', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=2048, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7fc7442c2a60>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'task': {'_name': 'streaming_language_modeling', 'data': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'end_of_document_symbol': '</s>', 'sample_break_mode': 'none', 'tokens_per_sample': 2048, 'max_source_positions': None, 'max_target_positions': None, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'data_buffer_size': 10, 'tpu': False, 'update_freq': [1]}, 'criterion': Namespace(_name='vocab_parallel_cross_entropy', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7fc7442c2a60>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.95)', 'adam_eps': 1e-08, 'weight_decay': 0.1, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0002], 'block_wise': False}, 'lr_scheduler': {'_name': 'polynomial_decay', 'warmup_updates': 357, 'force_anneal': None, 'end_learning_rate': 2e-05, 'zero_lr_warmup_steps': 0, 'power': 1.0, 'total_num_update': 286102.0, 'lr': [0.0002]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': {'_name': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True}, 'tokenizer': None, 'simul_type': None}
Loaded model
model_loading_time=69.8 seconds
model_loading_time_cuda=70.5 seconds
Changing max_positions from 2048 to 1024
task=copa
eval_set=val
eval language=en
train_set=None
train_lang=None
template=copa
calibration_options=[]
nb_few_shot_samples=0
(metaseq_20220328) tbmihaylov@learnfair1855:~/metaseq-internal$ python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME}  --tasks copa --nb-few-shot-samples-values 0 --max-positions 1024 --train-sep ' ' --scoring mean --fsdp  --distributed-world-size 8 --max-tokens 1024
model_name=1.3B_gptz_model_parallel
args:Namespace(add_bos_token=False, all_gather_list_size=16384, azureml_logging=False, batch_size=None, batch_size_valid=None, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, combine_valid_subsets=None, context_window=0, cpu=False, cpu_offload=False, criterion='cross_entropy', data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='pytorch_ddp', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=10938, distributed_rank=0, distributed_world_size=8, dont_log_param_and_grad_norm=False, empty_cache_freq=0, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, future_target=False, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=False, log_file=None, log_format=None, log_interval=100, log_nvidia_smi=False, lr_scheduler='fixed', max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_valid_steps=None, memory_efficient_fp16=True, min_loss_scale=0.0001, model_overrides='{}', model_parallel_size=1, new_profiler=False, no_progress_bar=False, no_reshard_after_forward=False, no_seed_provided=False, num_shards=1, num_workers=1, num_workers_valid=0, optimizer=None, output_dictionary_size=-1, output_word_probs=False, output_word_stats=False, pad_to_fixed_bsz=False, pad_to_fixed_length=False, past_target=False, path=None, plasma_path='/tmp/plasma', profile=False, required_batch_size_multiple=8, results_path=None, sample_break_mode='none', score_sequences=False, seed=1, self_target=False, shard_id=0, shorten_data_split_list='', shorten_method='none', shuffle_docs=False, skip_invalid_size_inputs_valid_test=False, softmax_batch=9223372036854775807, task='language_modeling', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=1024, train_subset='train', use_plasma_view=False, use_sharded_state=True, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_init_lr=-1, warmup_updates=4000, zero_sharding='none')
model_config:{'model_path': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt', 'extra_args': ['--use-sharded-state', '--memory-efficient-fp16', '--fp16', '--distributed-port', '10938', '--ddp-backend', 'fully_sharded'], 'model_overrides': {'bpe': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True, 'specify_arch': True, 'batch_size': None, 'batch_size_valid': None}, 'model_parallel_size': 2, 'distributed_world_size': 8}
fairseq_cfg.common.model_parallel_size:2
distributed_training.distributed_port=10938
> initializing tensor model parallel with size 2
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
name decoder.embed_tokens.weight parameters Parameter containing:
tensor([[ 0.0112, -0.0028, -0.0011,  ...,  0.0047,  0.0024,  0.0007],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0024, -0.0045, -0.0020,  ...,  0.0029, -0.0063,  0.0034],
        ...,
        [-0.0027, -0.0024, -0.0018,  ..., -0.0053,  0.0057, -0.0044],
        [-0.0017,  0.0046,  0.0141,  ..., -0.0020, -0.0072,  0.0018],
        [-0.0006,  0.0109, -0.0022,  ...,  0.0001, -0.0074, -0.0069]],
       requires_grad=True)
name decoder.embed_positions.weight parameters Parameter containing:
tensor([[-3.0068e-03,  1.0837e-02,  2.3782e-03,  ..., -6.6574e-03,
          4.9274e-03, -8.3703e-05],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [-1.5410e-03,  7.1037e-04,  2.2012e-03,  ...,  5.0920e-03,
         -3.7445e-04, -2.3676e-03],
        ...,
        [-2.7224e-03, -8.9764e-03, -3.7800e-03,  ..., -5.4837e-03,
         -1.3051e-03, -3.1310e-03],
        [ 1.3992e-03, -8.6830e-04, -4.5656e-03,  ..., -9.9817e-03,
          4.4617e-03, -9.1725e-04],
        [-1.0534e-02, -9.2207e-03,  4.0750e-03,  ..., -5.7695e-03,
          5.1774e-03,  5.1820e-03]], requires_grad=True)
name decoder.layers.0._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0014,  0.0041,  0.0014,  ...,  0.0014,  0.0017,  0.0009],
       requires_grad=True)
name decoder.layers.1._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0039, -0.0025,  0.0056,  ..., -0.0007,  0.0010, -0.0005],
       requires_grad=True)
name decoder.layers.2._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0017, -0.0010, -0.0103,  ..., -0.0021, -0.0008,  0.0004],
       requires_grad=True)
name decoder.layers.3._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0056,  0.0057, -0.0034,  ..., -0.0008, -0.0006, -0.0009],
       requires_grad=True)
name decoder.layers.4._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-1.3845e-03, -3.6725e-03,  2.9034e-03,  ...,  3.6426e-05,
         1.0134e-04,  7.8038e-04], requires_grad=True)
name decoder.layers.5._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 5.2280e-03,  3.2919e-03,  3.4002e-03,  ..., -2.8489e-04,
         9.0192e-05, -1.0724e-03], requires_grad=True)
name decoder.layers.6._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-6.9939e-03,  4.5992e-04,  2.4211e-03,  ...,  9.1476e-05,
        -5.1745e-04,  3.0494e-04], requires_grad=True)
name decoder.layers.7._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0064,  0.0011, -0.0098,  ..., -0.0002,  0.0012, -0.0002],
       requires_grad=True)
name decoder.layers.8._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0123,  0.0017, -0.0077,  ..., -0.0019,  0.0005, -0.0003],
       requires_grad=True)
name decoder.layers.9._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0012, -0.0081,  0.0020,  ..., -0.0013, -0.0002, -0.0002],
       requires_grad=True)
name decoder.layers.10._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0065, -0.0039,  0.0083,  ...,  0.0004,  0.0006, -0.0006],
       requires_grad=True)
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
name decoder.layers.11._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 9.7656e-03, -8.7866e-03, -5.2501e-03,  ..., -8.4295e-05,
         7.3552e-04, -6.2549e-04], requires_grad=True)
name decoder.layers.12._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0001,  0.0019,  0.0007,  ..., -0.0010,  0.0007, -0.0010],
       requires_grad=True)
name decoder.layers.13._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0069,  0.0098, -0.0094,  ..., -0.0001,  0.0008, -0.0004],
       requires_grad=True)
name decoder.layers.14._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0009, -0.0101, -0.0072,  ...,  0.0016,  0.0003,  0.0006],
       requires_grad=True)
name decoder.layers.15._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0079,  0.0049, -0.0042,  ..., -0.0003,  0.0012, -0.0010],
       requires_grad=True)
name decoder.layers.16._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-6.3444e-04, -5.2121e-03, -6.0865e-03,  ..., -9.2282e-04,
         1.5995e-03, -2.4477e-05], requires_grad=True)
name decoder.layers.17._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0008, -0.0024, -0.0002,  ...,  0.0008,  0.0013, -0.0007],
       requires_grad=True)
name decoder.layers.18._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0164,  0.0008, -0.0033,  ...,  0.0006,  0.0003,  0.0002],
       requires_grad=True)
name decoder.layers.19._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0048,  0.0054, -0.0005,  ..., -0.0012,  0.0014, -0.0005],
       requires_grad=True)
name decoder.layers.20._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0024,  0.0020, -0.0011,  ..., -0.0013,  0.0005,  0.0009],
       requires_grad=True)
name decoder.layers.21._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0008,  0.0004,  0.0055,  ..., -0.0023, -0.0004,  0.0002],
       requires_grad=True)
name decoder.layers.22._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0003,  0.0052,  0.0031,  ...,  0.0005,  0.0002, -0.0008],
       requires_grad=True)
name decoder.layers.23._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0040, -0.0059,  0.0030,  ...,  0.0006, -0.0004, -0.0006],
       requires_grad=True)
name decoder.layer_norm.weight parameters Parameter containing:
tensor([1., 1., 1.,  ..., 1., 1., 1.], requires_grad=True)
name decoder.layer_norm.bias parameters Parameter containing:
tensor([0., 0., 0.,  ..., 0., 0., 0.], requires_grad=True)
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
INFO:fairseq.checkpoint_utils:Done loading state dict
INFO:fairseq.models.fairseq_model:{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': '/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 4, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 2, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False, 'use_tutel_moe': False, 'new_profiler': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 64, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://hpc-pg0-132:18422', 'distributed_port': 18422, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': True, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None}, 'dataset': {'_name': None, 'num_workers': 8, 'num_workers_valid': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': None, 'required_batch_size_multiple': 1, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': True, 'validate_interval': 1, 'validate_interval_updates': 1000, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 286102, 'stop_time_hours': 0.0, 'clip_norm': 1.0, 'clip_norm_type': 'l2', 'skip_gradient_update_on_clip_norm': False, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0002], 'stop_min_lr': -1.0, 'use_bmuf': False, 'train_with_epoch_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 1000, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_best_checkpoints': True, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '-model_part-0', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': True, 's3_upload_path': 'https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', 'model_parallel_size': 2}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 64}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='transformer_lm_megatron', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=2048, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7f6e0ebf0a60>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'task': {'_name': 'streaming_language_modeling', 'data': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'end_of_document_symbol': '</s>', 'sample_break_mode': 'none', 'tokens_per_sample': 2048, 'max_source_positions': None, 'max_target_positions': None, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'data_buffer_size': 10, 'tpu': False, 'update_freq': [1]}, 'criterion': Namespace(_name='vocab_parallel_cross_entropy', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7f6e0ebf0a60>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.95)', 'adam_eps': 1e-08, 'weight_decay': 0.1, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0002], 'block_wise': False}, 'lr_scheduler': {'_name': 'polynomial_decay', 'warmup_updates': 357, 'force_anneal': None, 'end_learning_rate': 2e-05, 'zero_lr_warmup_steps': 0, 'power': 1.0, 'total_num_update': 286102.0, 'lr': [0.0002]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': {'_name': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True}, 'tokenizer': None, 'simul_type': None}
Loaded model
model_loading_time=71.6 seconds
model_loading_time_cuda=72.3 seconds
Changing max_positions from 2048 to 1024
task=copa
eval_set=val
eval language=en
train_set=None
train_lang=None
template=copa
calibration_options=[]
nb_few_shot_samples=0
expected_max_tgt_len=24, max_positions=1024
Average number of train samples: 0.00
Predicting 100 samples with 200 prompts..
Before running model, bs=1, max_tgt_len=23 mem=0.31GB
results={'model_name': '1.3B_gptz_model_parallel', 'task': 'copa', 'language': 'en', 'template': 'copa', 'nb_few_shot_samples': 0, 'calibration_options': [], 'calibrator_name': None, 'train_set': None, 'valid_set': None, 'eval_set': 'val', 'train_lang': None, 'valid_lang': None, 'ppl_common_prefix': {'scores': [932.2959832000732], 'mean': 932.2959832000732, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_selected_candidate': {'scores': [16.712474422454832], 'mean': 16.712474422454832, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_full_selected_candidate': {'scores': [84.09945600509644], 'mean': 84.09945600509644, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_candidates_full_prompt__choice1': {'scores': [105.29235980987549], 'mean': 105.29235980987549, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_candidates_full_prompt__choice2': {'scores': [104.99039335250855], 'mean': 104.99039335250855, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_candidates_full_prompt': {'scores': [105.14137658119202], 'mean': 105.14137658119202, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_candidates': {'scores': [28.473971219062804], 'mean': 28.473971219062804, 'std': 0.0, 'mean_confidence_interval': nan}, 'nb_trunc_few_shot_samples': {'scores': [0.0], 'mean': 0.0, 'std': 0.0, 'mean_confidence_interval': nan}, 'accuracy': {'scores': [70.0], 'mean': 70.0, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_answer_correct_gold': {'scores': [20.473032026290895], 'mean': 20.473032026290895, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_answer_incorrect_gold': {'scores': [36.47491041183471], 'mean': 36.47491041183471, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_answer_incorrect_std_gold': {'scores': [0.0], 'mean': 0.0, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_answer_incorrect_min_gold': {'scores': [36.47491041183471], 'mean': 36.47491041183471, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_answer_correct_lt_incorrect_gold': {'scores': [70.0], 'mean': 70.0, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_full_correct_gold': {'scores': [101.22055389404296], 'mean': 101.22055389404296, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_full_incorrect_gold': {'scores': [109.06219926834106], 'mean': 109.06219926834106, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_full_incorrect_std_gold': {'scores': [0.0], 'mean': 0.0, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_full_incorrect_min_gold': {'scores': [109.06219926834106], 'mean': 109.06219926834106, 'std': 0.0, 'mean_confidence_interval': nan}, 'ppl_full_correct_lt_incorrect_gold': {'scores': [67.0], 'mean': 67.0, 'std': 0.0, 'mean_confidence_interval': nan}, 'execution_time': {'scores': [4.199410188943148], 'mean': 4.199410188943148, 'std': 0.0, 'mean_confidence_interval': nan}}

ppl_selected_candidate               = 16.7125
ppl_full_selected_candidate          = 84.0995
ppl_full_incorrect_std_gold          = 0.0
ppl_full_incorrect_min_gold          = 109.0622
ppl_full_incorrect_gold              = 109.0622
ppl_full_correct_lt_incorrect_gold   = 67.0
ppl_full_correct_gold                = 101.2206
ppl_common_prefix                    = 932.296
ppl_candidates_full_prompt__choice2  = 104.9904
ppl_candidates_full_prompt__choice1  = 105.2924
ppl_candidates_full_prompt           = 105.1414
ppl_candidates                       = 28.474
ppl_answer_incorrect_std_gold        = 0.0
ppl_answer_incorrect_min_gold        = 36.4749
ppl_answer_incorrect_gold            = 36.4749
ppl_answer_correct_lt_incorrect_gold = 70.0
ppl_answer_correct_gold              = 20.473
nb_trunc_few_shot_samples            = 0.0
execution_time                       = 4.1994
accuracy                             = 70.0
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:14 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for 8 nodes.
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 5 using best-guess GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 6 using best-guess GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 4 using best-guess GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 7 using best-guess GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

[3] 1:bash  4:bash  5:azure- 6:srun*                                                                                                                                                                                             "devfair0254" 15:49 31-Mar-22

Mar 31 '22 22:03 tbmihailov

Glad you are unblocked :) but we should still not fail if TP=2 and world size =2. Can you update the title of the issue to reflect this? I (or someone) else can follow up.

Mar 31 '22 22:03 anj-s

metaseq metaseq copied to clipboard

Model_parallel=2 and 2 gpus on FAIR cluster: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking arugment for argument index in method wrapper_index_select)

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

metaseq
metaseq copied to clipboard