gpt-neo
gpt-neo copied to clipboard
Not able to generate predicted text after `Done with copy master to slices.` with 1.3B pre-trained model
Describe the bug
On running the main.py script using pre-trained 1.3B model with the --predict
flag on, the runtime is stuck for hours after printing Done with copy master to slices.
, and the predictions are not generated.
To Reproduce
Steps to reproduce the behavior:
- Download pre-trained 1.3B model from
https://mystic.the-eye.eu/public/AI/gptneo-release/GPT3_XL/
usingwget
- Create a file with prompt text
sample_prompt.txt
- Edit config file at
./GPT_1_3B/mystic.the-eye.eu/public/AI/gptneo-release/GPT3_XL/config.json
. set"mesh_shape" : "x:1,y:1"
(accprding to gpu devices), setmodel_path
toGPT_1_3B/mystic.the-eye.eu/public/AI/gptneo-release/GPT3_XL/
- From root directory of the repository, run
python3 main.py --predict --prompt sample_prompt.txt --gpu_ids 'device:GPU:0' --model "/home/sanchi/GPTNeo/GPT_1_3B/mystic.the-eye.eu/public/AI/gptneo-release/GPT3_XL/config.json"
Expected behavior Generate predicted text
Runtime Logs
Current step 362000
Saving config to /home/sanchi/GPTNeo/GPT_1_3B/mystic.the-eye.eu/public/AI/gptneo-release/GPT3_XL/
Done!
params = defaultdict(<function fetch_model_params.<locals>.<lambda> at 0x7f1f57167b80>, {'n_head': 16, 'n_vocab': 50257, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0, 'train_batch_size': 512, 'attn_dropout': 0, 'train_steps': 400000, 'lr_decay_end': 300000, 'eval_steps': 10, 'predict_steps': 0, 'res_dropout': 0, 'eval_batch_size': 128, 'predict_batch_size': 128, 'iterations': 500, 'n_embd': 2048, 'datasets': [['pile', None, None, None]], 'model_path': '/home/sanchi/GPTNeo/GPT_1_3B/mystic.the-eye.eu/public/AI/gptneo-release/GPT3_XL/', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local'], 'mesh_shape': 'x:1,y:2', 'layout': 'batch:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 4096, 'precision': 'bfloat16', 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'pile': {'n_vocab': 50257, 'path': 'gs://neo-datasets/pile/pile_*.tfrecords', 'eval_path': 'gs://neo-datasets/pile_val.tfrecords', 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 2, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': False, 'gpu_ids': ['device:GPU:0', 'device:GPU:1'], 'steps_per_checkpoint': 5000, 'predict': True, 'model': 'GPT', 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False})
Using config: {'_model_dir': '/home/sanchi/GPTNeo/GPT_1_3B/mystic.the-eye.eu/public/AI/gptneo-release/GPT3_XL/', '_tf_random_seed': None, '_save_summary_steps': 500, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=500, num_shards=2, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': None}
_TPUContext: eval_on_tpu True
eval_on_tpu ignored because use_tpu is False.
Predictions generated
Calling model_fn.
Running infer on CPU/GPU
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415)
Variable gpt2/h0/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h0/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h0/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h0/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h0/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h0/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h1/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h1/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h1/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h1/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h1/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h1/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h10/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h10/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h10/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h10/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h10/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h10/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h11/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h11/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h11/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h11/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h11/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h11/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h12/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h12/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h12/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h12/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h12/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h12/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h13/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h13/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h13/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h13/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h13/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h13/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h14/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h14/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h14/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h14/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h14/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h14/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h15/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h15/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h15/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h15/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h15/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h15/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h16/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h16/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h16/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h16/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h16/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h16/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h17/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h17/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h17/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h17/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h17/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h17/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h18/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h18/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h18/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h18/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h18/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h18/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h19/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h19/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h19/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h19/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h19/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h19/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h2/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h2/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h2/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h2/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h2/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h2/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h20/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h20/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h20/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h20/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h20/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h20/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h21/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h21/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h21/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h21/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h21/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h21/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h22/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h22/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h22/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h22/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h22/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h22/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h23/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h23/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h23/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h23/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h23/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h23/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h3/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h3/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h3/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h3/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h3/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h3/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h4/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h4/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h4/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h4/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h4/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h4/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h5/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h5/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h5/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h5/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h5/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h5/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h6/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h6/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h6/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h6/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h6/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h6/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h7/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h7/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h7/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h7/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h7/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h7/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h8/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h8/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h8/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h8/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h8/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h8/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/h9/attn/k size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h9/attn/o size 4194304 slice_size 2097152 Shape[heads=2048, embd=2048]
Variable gpt2/h9/attn/q size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h9/attn/v size 4194304 slice_size 2097152 Shape[embd=2048, heads=2048]
Variable gpt2/h9/mlp/conv1d_main/c_fc/kernel size 16777216 slice_size 8388608 Shape[embd=2048, intermediate_expanded=8192]
Variable gpt2/h9/mlp/conv1d_main/c_proj/kernel size 16777216 slice_size 8388608 Shape[intermediate_expanded=8192, embd=2048]
Variable gpt2/wpe size 4194304 slice_size 2097152 Shape[embed_sequence=2048, embd=2048]
Variable gpt2/wte size 102926336 slice_size 51463168 Shape[vocab=50257, embd=2048]
Variable stacked/gpt2/h0/mlp/conv1d_main/c_fc/bias size 65536 slice_size 65536 Shape[stacked=8, intermediate_expanded=8192]
gpt2/h0/mlp/conv1d_main/c_fc/bias
gpt2/h1/mlp/conv1d_main/c_fc/bias
gpt2/h2/mlp/conv1d_main/c_fc/bias
gpt2/h3/mlp/conv1d_main/c_fc/bias
gpt2/h4/mlp/conv1d_main/c_fc/bias
gpt2/h5/mlp/conv1d_main/c_fc/bias
gpt2/h6/mlp/conv1d_main/c_fc/bias
gpt2/h7/mlp/conv1d_main/c_fc/bias
Variable stacked/gpt2/h0/norm_1/g size 131072 slice_size 65536 Shape[stacked=64, embd=2048]
gpt2/h0/norm_1/g
gpt2/h0/norm_1/b
gpt2/h0/attn/compute_output_bias/o_b
gpt2/h0/norm_2/g
gpt2/h0/norm_2/b
gpt2/h0/mlp/conv1d_main/c_proj/bias
gpt2/h1/norm_1/g
gpt2/h1/norm_1/b
gpt2/h1/attn/compute_output_bias/o_b
gpt2/h1/norm_2/g
gpt2/h1/norm_2/b
gpt2/h1/mlp/conv1d_main/c_proj/bias
gpt2/h2/norm_1/g
gpt2/h2/norm_1/b
gpt2/h2/attn/compute_output_bias/o_b
gpt2/h2/norm_2/g
gpt2/h2/norm_2/b
gpt2/h2/mlp/conv1d_main/c_proj/bias
gpt2/h3/norm_1/g
gpt2/h3/norm_1/b
gpt2/h3/attn/compute_output_bias/o_b
gpt2/h3/norm_2/g
gpt2/h3/norm_2/b
gpt2/h3/mlp/conv1d_main/c_proj/bias
gpt2/h4/norm_1/g
gpt2/h4/norm_1/b
gpt2/h4/attn/compute_output_bias/o_b
gpt2/h4/norm_2/g
gpt2/h4/norm_2/b
gpt2/h4/mlp/conv1d_main/c_proj/bias
gpt2/h5/norm_1/g
gpt2/h5/norm_1/b
gpt2/h5/attn/compute_output_bias/o_b
gpt2/h5/norm_2/g
gpt2/h5/norm_2/b
gpt2/h5/mlp/conv1d_main/c_proj/bias
gpt2/h6/norm_1/g
gpt2/h6/norm_1/b
gpt2/h6/attn/compute_output_bias/o_b
gpt2/h6/norm_2/g
gpt2/h6/norm_2/b
gpt2/h6/mlp/conv1d_main/c_proj/bias
gpt2/h7/norm_1/g
gpt2/h7/norm_1/b
gpt2/h7/attn/compute_output_bias/o_b
gpt2/h7/norm_2/g
gpt2/h7/norm_2/b
gpt2/h7/mlp/conv1d_main/c_proj/bias
gpt2/h8/norm_1/g
gpt2/h8/norm_1/b
gpt2/h8/attn/compute_output_bias/o_b
gpt2/h8/norm_2/g
gpt2/h8/norm_2/b
gpt2/h8/mlp/conv1d_main/c_proj/bias
gpt2/h9/norm_1/g
gpt2/h9/norm_1/b
gpt2/h9/attn/compute_output_bias/o_b
gpt2/h9/norm_2/g
gpt2/h9/norm_2/b
gpt2/h9/mlp/conv1d_main/c_proj/bias
gpt2/h10/norm_1/g
gpt2/h10/norm_1/b
gpt2/h10/attn/compute_output_bias/o_b
gpt2/h10/norm_2/g
Variable stacked/gpt2/h10/norm_2/b size 131072 slice_size 65536 Shape[stacked=64, embd=2048]
gpt2/h10/norm_2/b
gpt2/h10/mlp/conv1d_main/c_proj/bias
gpt2/h11/norm_1/g
gpt2/h11/norm_1/b
gpt2/h11/attn/compute_output_bias/o_b
gpt2/h11/norm_2/g
gpt2/h11/norm_2/b
gpt2/h11/mlp/conv1d_main/c_proj/bias
gpt2/h12/norm_1/g
gpt2/h12/norm_1/b
gpt2/h12/attn/compute_output_bias/o_b
gpt2/h12/norm_2/g
gpt2/h12/norm_2/b
gpt2/h12/mlp/conv1d_main/c_proj/bias
gpt2/h13/norm_1/g
gpt2/h13/norm_1/b
gpt2/h13/attn/compute_output_bias/o_b
gpt2/h13/norm_2/g
gpt2/h13/norm_2/b
gpt2/h13/mlp/conv1d_main/c_proj/bias
gpt2/h14/norm_1/g
gpt2/h14/norm_1/b
gpt2/h14/attn/compute_output_bias/o_b
gpt2/h14/norm_2/g
gpt2/h14/norm_2/b
gpt2/h14/mlp/conv1d_main/c_proj/bias
gpt2/h15/norm_1/g
gpt2/h15/norm_1/b
gpt2/h15/attn/compute_output_bias/o_b
gpt2/h15/norm_2/g
gpt2/h15/norm_2/b
gpt2/h15/mlp/conv1d_main/c_proj/bias
gpt2/h16/norm_1/g
gpt2/h16/norm_1/b
gpt2/h16/attn/compute_output_bias/o_b
gpt2/h16/norm_2/g
gpt2/h16/norm_2/b
gpt2/h16/mlp/conv1d_main/c_proj/bias
gpt2/h17/norm_1/g
gpt2/h17/norm_1/b
gpt2/h17/attn/compute_output_bias/o_b
gpt2/h17/norm_2/g
gpt2/h17/norm_2/b
gpt2/h17/mlp/conv1d_main/c_proj/bias
gpt2/h18/norm_1/g
gpt2/h18/norm_1/b
gpt2/h18/attn/compute_output_bias/o_b
gpt2/h18/norm_2/g
gpt2/h18/norm_2/b
gpt2/h18/mlp/conv1d_main/c_proj/bias
gpt2/h19/norm_1/g
gpt2/h19/norm_1/b
gpt2/h19/attn/compute_output_bias/o_b
gpt2/h19/norm_2/g
gpt2/h19/norm_2/b
gpt2/h19/mlp/conv1d_main/c_proj/bias
gpt2/h20/norm_1/g
gpt2/h20/norm_1/b
gpt2/h20/attn/compute_output_bias/o_b
gpt2/h20/norm_2/g
gpt2/h20/norm_2/b
gpt2/h20/mlp/conv1d_main/c_proj/bias
gpt2/h21/norm_1/g
gpt2/h21/norm_1/b
Variable stacked/gpt2/h16/mlp/conv1d_main/c_fc/bias size 65536 slice_size 65536 Shape[stacked=8, intermediate_expanded=8192]
gpt2/h16/mlp/conv1d_main/c_fc/bias
gpt2/h17/mlp/conv1d_main/c_fc/bias
gpt2/h18/mlp/conv1d_main/c_fc/bias
gpt2/h19/mlp/conv1d_main/c_fc/bias
gpt2/h20/mlp/conv1d_main/c_fc/bias
gpt2/h21/mlp/conv1d_main/c_fc/bias
gpt2/h22/mlp/conv1d_main/c_fc/bias
gpt2/h23/mlp/conv1d_main/c_fc/bias
Variable stacked/gpt2/h21/attn/compute_output_bias/o_b size 36864 slice_size 18432 Shape[stacked=18, embd=2048]
gpt2/h21/attn/compute_output_bias/o_b
gpt2/h21/norm_2/g
gpt2/h21/norm_2/b
gpt2/h21/mlp/conv1d_main/c_proj/bias
gpt2/h22/norm_1/g
gpt2/h22/norm_1/b
gpt2/h22/attn/compute_output_bias/o_b
gpt2/h22/norm_2/g
gpt2/h22/norm_2/b
gpt2/h22/mlp/conv1d_main/c_proj/bias
gpt2/h23/norm_1/g
gpt2/h23/norm_1/b
gpt2/h23/attn/compute_output_bias/o_b
gpt2/h23/norm_2/g
gpt2/h23/norm_2/b
gpt2/h23/mlp/conv1d_main/c_proj/bias
gpt2/ln_f/g
gpt2/ln_f/b
Variable stacked/gpt2/h8/mlp/conv1d_main/c_fc/bias size 65536 slice_size 65536 Shape[stacked=8, intermediate_expanded=8192]
gpt2/h8/mlp/conv1d_main/c_fc/bias
gpt2/h9/mlp/conv1d_main/c_fc/bias
gpt2/h10/mlp/conv1d_main/c_fc/bias
gpt2/h11/mlp/conv1d_main/c_fc/bias
gpt2/h12/mlp/conv1d_main/c_fc/bias
gpt2/h13/mlp/conv1d_main/c_fc/bias
gpt2/h14/mlp/conv1d_main/c_fc/bias
gpt2/h15/mlp/conv1d_main/c_fc/bias
Trainable Variables count: 152 Total size: 1315575808 Total slice_size: 657886208
All Variables count: 152 Total size: 1315575808 Total slice_size: 657886208
Counters:
allconcat: 1.05e+06
allconcat/0: 1.05e+06
allconcat/0/reshape_op: 1.05e+06
allreduce: 2.19e+11
allreduce/[0]: 2
allreduce/[0]/reduce_op: 2
allreduce/[1]: 2.19e+11
allreduce/[1]/einsum_op: 2.19e+11
allreduce/[1]/reduce_op: 2.53e+08
einsum: 4.24e+14
einsum_unique: 4.11e+14
output: 3.36e+12
output/AddOperation: 7.75e+11
output/BinaryOpWithBroadcasting: 1.32e+08
output/BroadcastOperation: 1.03e+11
output/ConcatOperation: 5.15e+10
output/Constant: 4.92e+04
output/EinsumOperation: 8.01e+11
output/ImportOperation: 5.24e+05
output/OneHotOperation: 2.64e+10
output/RangeOperation: 6.35e+04
output/ReduceOperation: 4.54e+08
output/ReshapeOperation: 1.93e+11
output/ScalarAddOperation: 1.03e+11
output/ScalarMultiplyOperation: 3.22e+11
output/ShiftOperation: 2.58e+10
output/SlicewiseOperation: 7.55e+11
output/StackedVariable: 6.92e+05
output/StopGradient: 1.55e+11
output/UnstackOperation: 6.92e+05
output/Variable: 1.32e+09
output/WhileLoopOperation: 5.15e+10
output_unique: 2.32e+12
output_unique/AddOperation: 5.94e+11
output_unique/BinaryOpWithBroadcasting: 6.79e+07
output_unique/BroadcastOperation: 1.03e+11
output_unique/ConcatOperation: 2.58e+10
output_unique/Constant: 2.46e+04
output_unique/EinsumOperation: 4.92e+11
output_unique/ImportOperation: 2.62e+05
output_unique/OneHotOperation: 1.32e+10
output_unique/RangeOperation: 3.28e+04
output_unique/ReduceOperation: 2.27e+08
output_unique/ReshapeOperation: 1.03e+11
output_unique/ScalarAddOperation: 5.16e+10
output_unique/ScalarMultiplyOperation: 1.68e+11
output_unique/ShiftOperation: 1.29e+10
output_unique/SlicewiseOperation: 6e+11
output_unique/StackedVariable: 4.96e+05
output_unique/StopGradient: 1.29e+11
output_unique/UnstackOperation: 4.96e+05
output_unique/Variable: 1.32e+09
output_unique/WhileLoopOperation: 2.58e+10
variables: 1.32e+09
variables/trainable: 1.32e+09
Done calling model_fn.
Graph was finalized.
Restoring parameters from /home/sanchi/GPTNeo/GPT_1_3B/mystic.the-eye.eu/public/AI/gptneo-release/GPT3_XL/model.ckpt-362000
Running local_init_op.
Done running local_init_op.
Before copy master to slices.
Done with copy master to slices.
Environment:
- GPUs: I am using a DGX Machine with 4 GPUs of 32 GB RAM each.
- Configs: Ubuntu 18.04.5, conda environment with Python 3.9.7