CPM-1-Finetune-Text-Generation 使用该项目微调CPM-distill模型，无法加载

由于没有足够显存，因此准备微调CPM-distill（官方发布CPM-LM(2.6B)模型的蒸馏版本），使用该项目时却发现无法加载模型，报尺寸错误，请求帮忙分析下是什么原因

(liubiao2) kingsoft@k8s-w-10-13-84-7:~/liubiao2/smartWriter/CPM-1-Finetune$ bash scripts/novel/finetune_novel_fp32.sh
using world size: 1 and model-parallel size: 1
 > using dynamic loss scaling
> initializing model parallel with size 1
[2021-12-07 10:08:46,970] [INFO] [checkpointing.py:795:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2021-12-07 10:08:46,970] [INFO] [checkpointing.py:234:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 26051 and data parallel seed: 23333
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33461/33461 [00:01<00:00, 19888.74it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10526/10526 [00:00<00:00, 31213.74it/s]
building GPT2 model ...
 > number of parameters on model parallel rank 0: 1023544320
50 98
Optimizer = FusedAdam
learning rate decaying linear
DeepSpeed is enabled.
[2021-12-07 10:09:04,694] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.5.8, git-hash=unknown, git-branch=unknown
[2021-12-07 10:09:04,702] [INFO] [logging.py:69:log_dist] [Rank 0] initializing deepspeed groups using mpu
[2021-12-07 10:09:04,702] [INFO] [logging.py:69:log_dist] [Rank 0] Initializing deepspeed groups with model parallel size 1, expert parallel size 1, and data parallel size 1
[2021-12-07 10:09:04,702] [INFO] [logging.py:69:log_dist] [Rank 0] creating expert data parallel process group with ranks: [0]
[2021-12-07 10:09:04,702] [INFO] [logging.py:69:log_dist] [Rank 0] creating expert parallel process group with ranks: [0]
[2021-12-07 10:09:04,707] [INFO] [engine.py:279:__init__] DeepSpeed Flops Profiler Enabled: False
[2021-12-07 10:09:04,707] [INFO] [engine.py:1095:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2021-12-07 10:09:04,707] [INFO] [engine.py:1100:_configure_optimizer] Using client Optimizer as basic optimizer
[2021-12-07 10:09:04,711] [INFO] [engine.py:1117:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[2021-12-07 10:09:04,712] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2021-12-07 10:09:04,712] [INFO] [engine.py:808:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2021-12-07 10:09:04,712] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = <learning_rates.AnnealingLR object at 0x7fa961ee9278>
[2021-12-07 10:09:04,712] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2021-12-07 10:09:04,712] [INFO] [config.py:1059:print] DeepSpeedEngine configuration:
[2021-12-07 10:09:04,712] [INFO] [config.py:1063:print]   activation_checkpointing_config  {
    "partition_activations": true,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2021-12-07 10:09:04,712] [INFO] [config.py:1063:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-12-07 10:09:04,712] [INFO] [config.py:1063:print]   amp_enabled .................. False
[2021-12-07 10:09:04,712] [INFO] [config.py:1063:print]   amp_params ................... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": null,
    "exps_dir": null,
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   bfloat16_enabled ............. False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   checkpoint_tag_validation_enabled  True
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   checkpoint_tag_validation_fail  False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   communication_data_type ...... None
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   curriculum_enabled ........... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   curriculum_params ............ False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   dataloader_drop_last ......... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   disable_allgather ............ False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   dump_state ................... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   dynamic_loss_scale_args ...... None
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   eigenvalue_enabled ........... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   eigenvalue_gas_boundary_resolution  1
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   eigenvalue_layer_num ......... 0
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   eigenvalue_max_iter .......... 100
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   eigenvalue_stability ......... 1e-06
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   eigenvalue_tol ............... 0.01
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   eigenvalue_verbose ........... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   elasticity_enabled ........... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   flops_profiler_config ........ {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   fp16_enabled ................. False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   fp16_master_weights_and_gradients  False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   fp16_mixed_quantize .......... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print]   global_rank .................. 0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   gradient_accumulation_steps .. 4
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   gradient_clipping ............ 1.0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   gradient_predivide_factor .... 1.0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   initial_dynamic_scale ........ 4294967296
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   loss_scale ................... 0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   memory_breakdown ............. False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   optimizer_legacy_fusion ...... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   optimizer_name ............... None
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   optimizer_params ............. None
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   pld_enabled .................. False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   pld_params ................... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   prescale_gradients ........... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   quantize_change_rate ......... 0.001
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   quantize_groups .............. 1
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   quantize_offset .............. 1000
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   quantize_period .............. 1000
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   quantize_rounding ............ 0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   quantize_start_bits .......... 16
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   quantize_target_bits ......... 8
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   quantize_training_enabled .... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   quantize_type ................ 0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   quantize_verbose ............. False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   scheduler_name ............... None
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   scheduler_params ............. None
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   sparse_attention ............. None
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   sparse_gradients_enabled ..... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   steps_per_print .............. 100000000
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   tensorboard_enabled .......... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print]   tensorboard_output_path ......
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print]   train_batch_size ............. 4
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print]   train_micro_batch_size_per_gpu  1
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print]   use_quantizer_kernel ......... False
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print]   wall_clock_breakdown ......... False
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print]   world_size ................... 1
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print]   zero_allow_untested_optimizer  False
[2021-12-07 10:09:04,734] [INFO] [config.py:1063:print]   zero_config .................. {
    "stage": 0,
    "contiguous_gradients": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5.000000e+08,
    "allgather_partitions": true,
    "allgather_bucket_size": 5.000000e+08,
    "overlap_comm": false,
    "load_from_fp32_weights": true,
    "elastic_checkpoint": true,
    "offload_param": null,
    "offload_optimizer": null,
    "sub_group_size": 1.000000e+09,
    "prefetch_bucket_size": 5.000000e+07,
    "param_persistence_threshold": 1.000000e+05,
    "max_live_parameters": 1.000000e+09,
    "max_reuse_distance": 1.000000e+09,
    "gather_fp16_weights_on_model_save": false,
    "ignore_unused_parameters": true,
    "round_robin_gradients": false,
    "legacy_stage1": false
}
[2021-12-07 10:09:04,734] [INFO] [config.py:1063:print]   zero_enabled ................. False
[2021-12-07 10:09:04,734] [INFO] [config.py:1063:print]   zero_optimization_stage ...... 0
[2021-12-07 10:09:04,734] [INFO] [config.py:1071:print]   json = {
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 4,
    "steps_per_print": 1.000000e+08,
    "gradient_clipping": 1.0,
    "activation_checkpointing": {
        "partition_activations": true,
        "contiguous_memory_optimization": false
    },
    "wall_clock_breakdown": false
}
Using /home/kingsoft/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/kingsoft/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3976883888244629 seconds
[2021-12-07 10:09:06,672] [INFO] [state_dict_factory.py:109:get_merge_state_dicts] mp_rank: 0, ckpt_list: ['/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill/310000/mp_rank_00_model_states.pt', '/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill/310000/mp_rank_01_model_states.pt']
[2021-12-07 10:09:06,780] [INFO] [state_dict_factory.py:322:merge_state_dict] checkpoint version: 0
Traceback (most recent call last):
  File "finetune_text_generation.py", line 324, in <module>
    main()
  File "finetune_text_generation.py", line 208, in main
    model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/utils.py", line 510, in setup_model_and_optimizer
    args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/utils.py", line 281, in load_checkpoint
    checkpoint_name, sd = model.load_checkpoint(args.load, iteration, load_module_strict=False, load_optimizer_states=False, load_lr_scheduler_states=False)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2414, in load_checkpoint
    load_module_only=load_module_only)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2459, in _load_checkpoint
    strict=load_module_strict)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2302, in load_module_state_dict
    self.module.load_state_dict(state_dict, strict=strict)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/model/distributed.py", line 90, in load_state_dict
    self.module.load_state_dict(state_dict, strict=strict)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for GPT2Model:
        size mismatch for word_embeddings.weight: copying a param with shape torch.Size([30000, 768]) from checkpoint, the shape in current model is torch.Size([30000, 2560]).
        size mismatch for position_embeddings.weight: copying a param with shape torch.Size([1024, 768]) from checkpoint, the shape in current model is torch.Size([1024, 2560]).
        size mismatch for transformer.layers.0.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.0.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.0.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.0.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.0.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.0.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.0.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.0.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.0.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.0.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.0.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.0.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.1.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.1.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.1.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.1.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.1.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.1.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.2.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.2.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.2.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.2.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.2.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.2.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.3.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.3.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.3.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.3.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.3.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.3.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.4.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.4.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.4.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.4.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.4.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.4.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.5.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.5.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.5.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.5.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.5.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.5.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.6.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.6.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.6.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.6.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.6.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.6.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.6.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.6.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.6.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.6.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.6.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.6.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.7.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.7.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.7.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.7.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.7.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.7.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.7.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.7.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.7.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.7.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.7.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.7.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.8.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.8.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.8.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.8.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.8.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.8.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.8.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.8.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.8.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.8.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.8.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.8.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.9.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.9.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.9.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.9.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.9.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.9.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.9.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.9.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.9.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.9.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.9.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.9.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.10.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.10.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.10.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.10.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.10.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.10.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.10.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.10.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.10.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.10.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.10.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.10.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.11.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.11.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.11.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.11.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.11.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.11.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.11.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.11.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.11.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.11.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.11.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.11.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.final_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.final_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
Traceback (most recent call last):
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/kingsoft/anaconda3/envs/liubiao2/bin/python3', '-u', 'finetune_text_generation.py', '--local_rank=0', '--do_train', '--do_eval', '--data_dir', './data/novel/preprocessed_id/', '--model-parallel-size', '1', '--num-layers', '12', '--hidden-size', '2560', '--load', '/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--lr', '0.00001', '--warmup', '0.1', '--batch-size', '1', '--deepspeed', '--deepspeed_config', '/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/scripts/novel/../ds_config/ds_finetune_large_fp32.json', '--log-interval', '10', '--eval-interval', '50', '--seed', '23333', '--results_dir', 'results/', '--model_name', 'finetune-novel', '--epoch', '10', '--checkpoint-activations', '--deepspeed-activation-checkpointing']' returned non-zero exit status 1.

Dec 07 '21 03:12 Biaocsu

您可以参考 https://github.com/TsinghuaAI/CPM-1-Distill 来修改配置

Dec 07 '21 06:12 zhenhao-huang

您可以参考 https://github.com/TsinghuaAI/CPM-1-Distill 来修改配置

您好，还是会出现错误，可以指出哪里配置有问题吗？多谢

finetune_novel_fp32.sh配置

#!/bin/bash

DATA_DIR="./data/novel/preprocessed_id/"
CHECKPOINT_PATH="/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill"
RESULTS_DIR="results/"
MODEL_NAME="finetune-novel"
TOKENIZER_PATH="bpe_3w_new/"
MPSIZE=1
NLAYERS=12
NHIDDEN=768
NATT=12
MAXSEQLEN=1024

CUR_PATH=$(realpath $0)
CUR_DIR=$(dirname ${CUR_PATH})
DS_CONFIG="${CUR_DIR}/../ds_config/ds_finetune_large_fp32.json"

python3 -m torch.distributed.launch --master_port ${1-1122} --nproc_per_node 1 finetune_text_generation.py \
       --do_train \
       --do_eval \
       --data_dir ${DATA_DIR} \
       --model-parallel-size ${MPSIZE} \
       --num-layers ${NLAYERS} \
       --hidden-size ${NHIDDEN} \
       --load ${CHECKPOINT_PATH} \
       --num-attention-heads ${NATT} \
       --seq-length ${MAXSEQLEN} \
       --max-position-embeddings 1024 \
       --tokenizer-type GPT2BPETokenizer \
       --tokenizer-path ${TOKENIZER_PATH} \
       --vocab-size 30000 \
       --lr 0.00001 \
       --warmup 0.1 \
       --batch-size 1 \
       --deepspeed \
       --deepspeed_config ${DS_CONFIG} \
       --log-interval 10 \
       --eval-interval 50 \
       --seed 23333 \
       --results_dir ${RESULTS_DIR} \
       --model_name ${MODEL_NAME} \
       --epoch 10 \
       --checkpoint-activations \
       --deepspeed-activation-checkpointing

报错提示：

Traceback (most recent call last):
  File "finetune_text_generation.py", line 324, in <module>
    main()
  File "finetune_text_generation.py", line 238, in main
    output = model(**batch)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1606, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/model/distributed.py", line 78, in forward
    return self.module(*inputs, **kwargs)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/model/gpt2_modeling.py", line 97, in forward
    transformer_output = self.transformer(embeddings, attention_mask)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/mpu/transformer.py", line 412, in forward
    hidden_states, attention_mask)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 743, in checkpoint
    CheckpointFunction.apply(function, all_outputs, *args)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 582, in forward
    outputs = run_function(*inputs_cuda)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/mpu/transformer.py", line 402, in custom_forward
    x_ = layer(x_, inputs[1])
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/mpu/transformer.py", line 288, in forward
    attention_output = self.attention(layernorm_output, ltor_mask)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/mpu/transformer.py", line 132, in forward
    attention_scores = torch.mul(attention_scores, ltor_mask) - \
RuntimeError: The size of tensor a (1024) must match the size of tensor b (1048576) at non-singleton dimension 3
Traceback (most recent call last):
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/kingsoft/anaconda3/envs/liubiao2/bin/python3', '-u', 'finetune_text_generation.py', '--local_rank=0', '--do_train', '--do_eval', '--data_dir', './data/novel/preprocessed_id/', '--model-parallel-size', '1', '--num-layers', '12', '--hidden-size', '768', '--load', '/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill', '--num-attention-heads', '12', '--seq-length', '1024', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--lr', '0.00001', '--warmup', '0.1', '--batch-size', '1', '--deepspeed', '--deepspeed_config', '/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/scripts/novel/../ds_config/ds_finetune_large_fp32.json', '--log-interval', '10', '--eval-interval', '50', '--seed', '23333', '--results_dir', 'results/', '--model_name', 'finetune-novel', '--epoch', '10', '--checkpoint-activations', '--deepspeed-activation-checkpointing']' returned non-zero exit status 1.

Dec 07 '21 08:12 Biaocsu

deepspeed配置对了吗

Dec 08 '21 09:12 zhenhao-huang

deepspeed配置对了吗

deepspeed版本根据原项目配置的，版本 deepspeed==0.3.15 还是没有找出原因

Dec 08 '21 09:12 Biaocsu

不知道是不是跟单卡等有关，会出现莫名其妙的报错

Dec 08 '21 09:12 Biaocsu

ds_finetune_large_fp32.json文件根据 https://github.com/TsinghuaAI/CPM-1-Distill/blob/main/configs/deepspeed/ds_zero2_config_small.json 来配置

Dec 08 '21 11:12 zhenhao-huang

多谢，我试试

Dec 08 '21 11:12 Biaocsu

您好，您那边可以帮忙运行试试吗？我这边始终会出现报错，而且按理说跟配置文件没有太大关系的麻烦了，多谢

finetune_novel_fp32.sh文件：

#!/bin/bash

DATA_DIR="./data/novel/preprocessed_id/"
CHECKPOINT_PATH="./path_v2/to/CPM-distill"
RESULTS_DIR="results/"
MODEL_NAME="finetune-novel"
TOKENIZER_PATH="bpe_3w_new/"
MPSIZE=1
NLAYERS=6
NHIDDEN=2560
NATT=32
MAXSEQLEN=1024

CUR_PATH=$(realpath $0)
CUR_DIR=$(dirname ${CUR_PATH})
DS_CONFIG="${CUR_DIR}/../ds_config/ds_zero2_config_small.json"

python3 -m torch.distributed.launch --master_port ${1-1122} --nproc_per_node 1 finetune_text_generation.py \
       --do_train \
       --do_eval \
       --data_dir ${DATA_DIR} \
       --model-parallel-size ${MPSIZE} \
       --num-layers ${NLAYERS} \
       --hidden-size ${NHIDDEN} \
       --load ${CHECKPOINT_PATH} \
       --num-attention-heads ${NATT} \
       --seq-length ${MAXSEQLEN} \
       --max-position-embeddings 1024 \
       --tokenizer-type GPT2BPETokenizer \
       --tokenizer-path ${TOKENIZER_PATH} \
       --vocab-size 30000 \
       --lr 0.00001 \
       --warmup 0.1 \
       --batch-size 1 \
       --deepspeed \
       --deepspeed_config ${DS_CONFIG} \
       --log-interval 10 \
       --eval-interval 50 \
       --seed 23333 \
       --results_dir ${RESULTS_DIR} \
       --model_name ${MODEL_NAME} \
       --epoch 10 \
       --checkpoint-activations \
       --deepspeed-activation-checkpointing

报错信息：

(liubiao2) kingsoft@k8s-w-10-13-84-7:~/liubiao2/smartWriter/CPM-1-Finetune$ bash scripts/novel/finetune_novel_fp32.sh
using world size: 1 and model-parallel size: 1
 > using dynamic loss scaling
> initializing model parallel with size 1
[2021-12-09 10:11:40,637] [INFO] [checkpointing.py:734:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2021-12-09 10:11:40,637] [INFO] [checkpointing.py:231:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 26051 and data parallel seed: 23333
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6777/6777 [00:00<00:00, 15329.86it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 385/385 [00:00<00:00, 35938.91it/s]
building GPT2 model ...
 > number of parameters on model parallel rank 0: 551485440
26 50
Optimizer = FusedAdam
learning rate decaying linear
DeepSpeed is enabled.
[2021-12-09 10:11:48,665] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15, git-hash=unknown, git-branch=unknown
[2021-12-09 10:11:48,674] [INFO] [engine.py:605:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2021-12-09 10:11:48,674] [INFO] [engine.py:609:_configure_optimizer] Using client Optimizer as basic optimizer
[2021-12-09 10:11:48,674] [INFO] [engine.py:619:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
Checking ZeRO support for optimizer=FusedAdam type=<class 'apex.optimizers.fused_adam.FusedAdam'>
[2021-12-09 10:11:48,674] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-12-09 10:11:48,674] [INFO] [stage2.py:101:__init__] Reduce bucket size 500000000
[2021-12-09 10:11:48,674] [INFO] [stage2.py:102:__init__] Allgather bucket size 500000000
[2021-12-09 10:11:48,674] [INFO] [stage2.py:103:__init__] CPU Offload: False
Using /home/kingsoft/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/kingsoft/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.28600168228149414 seconds
[2021-12-09 10:11:50,219] [INFO] [stage2.py:375:__init__] optimizer state initialized
[2021-12-09 10:11:50,219] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2021-12-09 10:11:50,219] [INFO] [engine.py:455:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2021-12-09 10:11:50,219] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <learning_rates.AnnealingLR object at 0x7efcedbc7978>
[2021-12-09 10:11:50,219] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2021-12-09 10:11:50,219] [INFO] [config.py:741:print] DeepSpeedEngine configuration:
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   activation_checkpointing_config  {
    "partition_activations": true,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   allreduce_always_fp32 ........ False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   amp_enabled .................. False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   amp_params ................... False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   checkpoint_tag_validation_enabled  True
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   checkpoint_tag_validation_fail  False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   disable_allgather ............ False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   dump_state ................... False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   dynamic_loss_scale_args ...... {'init_scale': 262144, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   elasticity_enabled ........... False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   flops_profiler_config ........ {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 3,
    "detailed": true
}
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   fp16_enabled ................. True
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   global_rank .................. 0
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   gradient_accumulation_steps .. 2
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   gradient_clipping ............ 1.0
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   gradient_predivide_factor .... 1.0
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   initial_dynamic_scale ........ 262144
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   loss_scale ................... 0
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   memory_breakdown ............. False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   optimizer_legacy_fusion ...... False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   optimizer_name ............... None
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   optimizer_params ............. None
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   pld_enabled .................. False
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   pld_params ................... False
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   prescale_gradients ........... False
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   scheduler_name ............... None
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   scheduler_params ............. None
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   sparse_attention ............. None
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   sparse_gradients_enabled ..... False
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   steps_per_print .............. 100
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   tensorboard_enabled .......... False
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   tensorboard_output_path ......
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   train_batch_size ............. 2
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   train_micro_batch_size_per_gpu  1
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   wall_clock_breakdown ......... True
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   world_size ................... 1
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   zero_allow_untested_optimizer  True
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   zero_config .................. {
    "stage": 2,
    "contiguous_gradients": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5.000000e+08,
    "allgather_partitions": true,
    "allgather_bucket_size": 5.000000e+08,
    "overlap_comm": false,
    "load_from_fp32_weights": true,
    "elastic_checkpoint": true,
    "offload_param": null,
    "offload_optimizer": null,
    "sub_group_size": 1.000000e+12,
    "prefetch_bucket_size": 5.000000e+07,
    "param_persistence_threshold": 1.000000e+05,
    "max_live_parameters": 1.000000e+09,
    "max_reuse_distance": 1.000000e+09,
    "gather_fp16_weights_on_model_save": false
}
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   zero_enabled ................. True
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print]   zero_optimization_stage ...... 2
[2021-12-09 10:11:50,221] [INFO] [config.py:752:print]   json = {
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 2,
    "steps_per_print": 100,
    "zero_optimization": {
        "stage": 2
    },
    "zero_allow_untested_optimizer": true,
    "gradient_clipping": 1.0,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 18,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "activation_checkpointing": {
        "partition_activations": true,
        "contiguous_memory_optimization": false
    },
    "wall_clock_breakdown": true
}
Using /home/kingsoft/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00041604042053222656 seconds
[2021-12-09 10:11:50,222] [INFO] [engine.py:1464:_load_checkpoint] rank: 0 loading checkpoint: ./path_v2/to/CPM-distill/310000/mp_rank_00_model_states.pt
Traceback (most recent call last):
  File "finetune_text_generation.py", line 324, in <module>
    main()
  File "finetune_text_generation.py", line 208, in main
    model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/utils.py", line 510, in setup_model_and_optimizer
    args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/utils.py", line 281, in load_checkpoint
    checkpoint_name, sd = model.load_checkpoint(args.load, iteration, load_module_strict=False, load_optimizer_states=False, load_lr_scheduler_states=False)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1440, in load_checkpoint
    load_lr_scheduler_states=load_lr_scheduler_states)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1472, in _load_checkpoint
    strict=load_module_strict)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1373, in load_module_state_dict
    self.module.load_state_dict(state_dict, strict=strict)
  File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/model/distributed.py", line 90, in load_state_dict
    self.module.load_state_dict(state_dict, strict=strict)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for GPT2Model:
        size mismatch for word_embeddings.weight: copying a param with shape torch.Size([15000, 768]) from checkpoint, the shape in current model is torch.Size([30000, 2560]).
        size mismatch for position_embeddings.weight: copying a param with shape torch.Size([1024, 768]) from checkpoint, the shape in current model is torch.Size([1024, 2560]).
        size mismatch for transformer.layers.0.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.0.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.0.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.0.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.0.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.0.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.0.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.0.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.0.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.0.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.0.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.0.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.1.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.1.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.1.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.1.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.1.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.1.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.1.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.2.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.2.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.2.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.2.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.2.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.2.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.2.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.3.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.3.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.3.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.3.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.3.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.3.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.3.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.4.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.4.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.4.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.4.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.4.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.4.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.4.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
        size mismatch for transformer.layers.5.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
        size mismatch for transformer.layers.5.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for transformer.layers.5.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.layers.5.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for transformer.layers.5.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
        size mismatch for transformer.layers.5.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for transformer.layers.5.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.final_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for transformer.final_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
Traceback (most recent call last):
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/kingsoft/anaconda3/envs/liubiao2/bin/python3', '-u', 'finetune_text_generation.py', '--local_rank=0', '--do_train', '--do_eval', '--data_dir', './data/novel/preprocessed_id/', '--model-parallel-size', '1', '--num-layers', '6', '--hidden-size', '2560', '--load', './path_v2/to/CPM-distill', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--lr', '0.00001', '--warmup', '0.1', '--batch-size', '1', '--deepspeed', '--deepspeed_config', '/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/scripts/novel/../ds_config/ds_zero2_config_small.json', '--log-interval', '10', '--eval-interval', '50', '--seed', '23333', '--results_dir', 'results/', '--model_name', 'finetune-novel', '--epoch', '10', '--checkpoint-activations', '--deepspeed-activation-checkpointing']' returned non-zero exit status 1.

Dec 09 '21 02:12 Biaocsu

您好，我现在不太方便调试。您的这个错误是层数没配置对，您可以尝试按这个

#!/bin/bash

DATA_DIR="./data/novel/preprocessed_id/"
CHECKPOINT_PATH="/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill"
RESULTS_DIR="results/"
MODEL_NAME="finetune-novel"
TOKENIZER_PATH="bpe_3w_new/"
MPSIZE=1
NLAYERS=12
NHIDDEN=768
NATT=12
MAXSEQLEN=1024

CUR_PATH=$(realpath $0)
CUR_DIR=$(dirname ${CUR_PATH})
DS_CONFIG="${CUR_DIR}/../ds_config/ds_finetune_large_fp32.json"

python3 -m torch.distributed.launch --master_port ${1-1122} --nproc_per_node 1             finetune_text_generation.py \
       --do_train \
       --do_eval \
       --data_dir ${DATA_DIR} \
       --model-parallel-size ${MPSIZE} \
       --num-layers ${NLAYERS} \
       --hidden-size ${NHIDDEN} \
       --load ${CHECKPOINT_PATH} \
       --num-attention-heads ${NATT} \
       --seq-length ${MAXSEQLEN} \
       --max-position-embeddings 1024 \
       --tokenizer-type GPT2BPETokenizer \
       --tokenizer-path ${TOKENIZER_PATH} \
       --vocab-size 30000 \
       --lr 0.00001 \
       --warmup 0.1 \
       --batch-size 1 \
       --deepspeed \
       --deepspeed_config ${DS_CONFIG} \
       --log-interval 10 \
       --eval-interval 50 \
       --seed 23333 \
       --results_dir ${RESULTS_DIR} \
       --model_name ${MODEL_NAME} \
       --epoch 10 \
       --checkpoint-activations \
       --deepspeed-activation-checkpointing

来配置，并且根据 https://github.com/TsinghuaAI/CPM-1-Distill/blob/main/configs/deepspeed/ds_zero2_config_small.json 来配置ds_finetune_large_fp32.jsondeepspeed文件

Dec 09 '21 03:12 zhenhao-huang

修改ds_finetune_large_fp32.json对应参数即可

Dec 09 '21 03:12 zhenhao-huang

还是不行，参数尝试了很多种，都会有问题。打算放弃调研CPM了，一方面模型太大，另一方面生成文本速度太慢。

您可以推荐几个小说文本生成模型吗？对这方面不是特别熟悉，感谢

Dec 09 '21 03:12 Biaocsu

那可能蒸馏之后的模型相应的计算方式也更改了，您可以使用https://github.com/TsinghuaAI/CPM-1-Distill 这个repo的代码运行，再结合 https://github.com/zhenhao-huang/CPM-1-Finetune-Text-Generation/blob/main/finetune_text_generation.py 修改相应部分的文本生成模板。
https://github.com/Morizeyao/GPT2-Chinese 这个repo也可以用来做文本生成，不过目前的趋势都是参数量越大效果越好。

Dec 09 '21 03:12 zhenhao-huang

好的，多谢

Dec 09 '21 06:12 Biaocsu

CPM-1-Finetune-Text-Generation CPM-1-Finetune-Text-Generation copied to clipboard

使用该项目微调CPM-distill模型，无法加载

CPM-1-Finetune-Text-Generation
CPM-1-Finetune-Text-Generation copied to clipboard