RWKV-Runner icon indicating copy to clipboard operation
RWKV-Runner copied to clipboard

LoRa微调训练报错

Open SerMs opened this issue 1 year ago • 12 comments

非常抱歉占用博主您宝贵的时间,麻烦请你帮我解答两个问题

  1. 我按照训练微调提示要求在微软商店下载了WSL及Ubuntu,在xu训练的时候报错提示如下 image
  2. 可以不在Windows上安装Ubuntu吗?如果用VM安装Ubuntu可以用吗?
  3. 还有这个训练数据集的Prompt格式是哪种呢? image

SerMs avatar Jan 11 '24 06:01 SerMs

  1. 你在windows控制台上输入一下wsl, 看提示的错误是什么, 通过网络搜索解决, 一般是需要安装特定组件或者开启虚拟化功能

  2. windows必须用WSL安装Ubuntu, 因为WSL支持显卡直通

  3. 如果你要训练小说, 那么就是整本小说的文本作为一行, 就像你截图的第一行; 如果要训练对话, 则是像第三行那样, 写成对话格式; 如果要训练指令或者代码, 就像第四第五行那样. 总而言之, 你想要lora微调过的模型怎样被使用, 就提供怎样形式的数据. 此外, 建议塞一些与你的微调目标无关的, 各种其他形式的数据(例如对话, 续写), 避免模型过拟合降智

josStorer avatar Jan 11 '24 07:01 josStorer

{ "Instruction": "question ", "Input": " background knowledge", "Response": "answer", }

这种方式可以嘛

SerMs avatar Jan 12 '24 02:01 SerMs

请问怎么看训练的进度和结果呢? image

SerMs avatar Jan 12 '24 02:01 SerMs

{ "Instruction": "question ", "Input": " background knowledge", "Response": "answer", }

这种方式可以嘛

不能用这种方式, 看第四行, 基底模型只有这种形式的指令支持, 用你这个形式做lora微调效果不会很好, 建议要么全量微调, 要么参考第四行这种格式

你上面的截图是在安装训练所需的依赖, 安装完毕后开始训练会显示进度的

josStorer avatar Jan 12 '24 02:01 josStorer

一般训练大概要多久,仅仅只是测试训练的话,用的0.1B的模型

SerMs avatar Jan 12 '24 02:01 SerMs

image 卡这里不动了,是还在加载中嘛?还是以及暂停了,每太看懂他这执行的意思

SerMs avatar Jan 12 '24 03:01 SerMs

再次点击训练, 看是否出现了gcc installed; requirements satisfied

josStorer avatar Jan 12 '24 04:01 josStorer

Building dependency tree... Reading state information... Package gcc is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source However the following packages replace it: gcc-11-doc gcc-9-doc gcc-12-doc gcc-10-doc E: Package 'gcc' has no installation candidate pip installed WARNING: apt does not have a stable CLI interface. Use with caution in scripts. Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package ninja-build --2024-01-12 13:59:55-- https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.39.144 Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.39.144|:443... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://developer.download.nvidia.cn/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin [following] --2024-01-12 14:00:11-- https://developer.download.nvidia.cn/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin Resolving developer.download.nvidia.cn (developer.download.nvidia.cn)... 175.4.58.178, 175.4.58.179, 175.4.58.180, ... Connecting to developer.download.nvidia.cn (developer.download.nvidia.cn)|175.4.58.178|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 190 [application/octet-stream] Saving to: ‘cuda-wsl-ubuntu.pin’ 0K 100% 26.6M=0s 2024-01-12 14:00:12 (26.6 MB/s) - ‘cuda-wsl-ubuntu.pin’ saved [190/190] --2024-01-12 14:00:12-- https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-wsl-ubuntu-12-2-local_12.2.0-1_amd64.deb Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.39.144 Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.39.144|:443... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://developer.download.nvidia.cn/compute/cuda/12.2.0/local_installers/cuda-repo-wsl-ubuntu-12-2-local_12.2.0-1_amd64.deb [following] --2024-01-12 14:00:16-- https://developer.download.nvidia.cn/compute/cuda/12.2.0/local_installers/cuda-repo-wsl-ubuntu-12-2-local_12.2.0-1_amd64.deb Resolving developer.download.nvidia.cn (developer.download.nvidia.cn)... 175.4.58.178, 175.4.58.179, 175.4.58.180, ... Connecting to developer.download.nvidia.cn (developer.download.nvidia.cn)|175.4.58.178|:443... connected. HTTP request sent, awaiting response... 304 Not Modified File ‘cuda-repo-wsl-ubuntu-12-2-local_12.2.0-1_amd64.deb’ not modified on server. Omitting download. (Reading database ... 25419 files and directories currently installed.) Preparing to unpack cuda-repo-wsl-ubuntu-12-2-local_12.2.0-1_amd64.deb ... Unpacking cuda-repo-wsl-ubuntu-12-2-local (12.2.0-1) over (12.2.0-1) ... Setting up cuda-repo-wsl-ubuntu-12-2-local (12.2.0-1) ... Reading package lists... E: Could not get lock /var/lib/apt/lists/lock. It is held by process 979 (apt-get) E: Unable to lock directory /var/lib/apt/lists/ Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package cuda requirements satisfied loading models/RWKV-5-World-1B5-v2-20231025-ctx4096.pth v5/train.py --vocab_size 65536 --n_layer 24 --n_embd 2048 INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpw37_4auo INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpw37_4auo/_remote_module_non_scriptable.py INFO:pytorch_lightning.utilities.rank_zero:########## work in progress ########## [2024-01-12 14:01:48,105] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO:pytorch_lightning.utilities.rank_zero: ############################################################################

RWKV-5 BF16 on 1x1 GPU, bsz 1x1x1=1, deepspeed_stage_2 with grad_cp

Data = ./finetune/json2binidx_tool/data/test2_text_document (binidx), ProjDir = lora-models

Epoch = 0 to 19, save every 1 epoch

Each "epoch" = 200 steps, 200 samples, 204800 tokens

Model = 24 n_layer, 2048 n_embd, 1024 ctx_len

Adam = lr 5e-05 to 5e-05, warmup 0 steps, beta (0.9, 0.999), eps 1e-08

Found torch 1.13.1+cu117, recommend 1.13.1+cu117 or newer

Found deepspeed 0.11.2, recommend 0.7.0 (faster than newer versions)

Found pytorch_lightning 1.9.5, recommend 1.9.5

############################################################################ INFO:pytorch_lightning.utilities.rank_zero:{'load_model': 'models/RWKV-5-World-1B5-v2-20231025-ctx4096.pth', 'wandb': '', 'proj_dir': 'lora-models', 'random_seed': -1, 'data_file': './finetune/json2binidx_tool/data/test2_text_document', 'data_type': 'binidx', 'vocab_size': 65536, 'ctx_len': 1024, 'epoch_steps': 200, 'epoch_count': 20, 'epoch_begin': 0, 'epoch_save': 1, 'micro_bsz': 1, 'n_layer': 24, 'n_embd': 2048, 'dim_att': 2048, 'dim_ffn': 7168, 'pre_ffn': 1, 'head_qk': 1, 'tiny_att_dim': 0, 'tiny_att_layer': -999, 'lr_init': 5e-05, 'lr_final': 5e-05, 'warmup_steps': 0, 'beta1': 0.9, 'beta2': 0.999, 'adam_eps': 1e-08, 'grad_cp': 1, 'dropout': 0, 'weight_decay': 0, 'weight_decay_final': -1, 'my_pile_version': 1, 'my_pile_stage': 0, 'my_pile_shift': -1, 'my_pile_edecay': 0, 'layerwise_lr': 1, 'ds_bucket_mb': 200, 'my_sample_len': 0, 'my_ffn_shift': 1, 'my_att_shift': 1, 'head_size_a': 64, 'head_size_divisor': 8, 'my_pos_emb': 0, 'load_partial': 0, 'magic_prime': 0, 'my_qa_mask': 0, 'my_random_steps': 0, 'my_testing': '', 'my_exit': 99999999, 'my_exit_tokens': 0, 'emb': False, 'lora': True, 'lora_load': '', 'lora_r': 8, 'lora_alpha': 32.0, 'lora_dropout': 0.01, 'lora_parts': 'att,ffn,time,ln', 'logger': False, 'enable_checkpointing': False, 'default_root_dir': None, 'gradient_clip_val': 1.0, 'gradient_clip_algorithm': None, 'num_nodes': 1, 'num_processes': None, 'devices': '1', 'gpus': None, 'auto_select_gpus': None, 'tpu_cores': None, 'ipus': None, 'enable_progress_bar': True, 'overfit_batches': 0.0, 'track_grad_norm': -1, 'check_val_every_n_epoch': 100000000000000000000, 'fast_dev_run': False, 'accumulate_grad_batches': 8, 'max_epochs': 20, 'min_epochs': None, 'max_steps': -1, 'min_steps': None, 'max_time': None, 'limit_train_batches': None, 'limit_val_batches': None, 'limit_test_batches': None, 'limit_predict_batches': None, 'val_check_interval': None, 'log_every_n_steps': 100000000000000000000, 'accelerator': 'gpu', 'strategy': 'deepspeed_stage_2', 'sync_batchnorm': False, 'precision': 'bf16', 'enable_model_summary': True, 'num_sanity_val_steps': 0, 'resume_from_checkpoint': None, 'profiler': None, 'benchmark': None, 'reload_dataloaders_every_n_epochs': 0, 'auto_lr_find': False, 'replace_sampler_ddp': False, 'detect_anomaly': False, 'auto_scale_batch_size': False, 'plugins': None, 'amp_backend': None, 'amp_level': None, 'move_metrics_to_cpu': False, 'multiple_trainloader_mode': 'max_size_cycle', 'inference_mode': True, 'my_timestamp': '2024-01-12-14-01-50', 'betas': (0.9, 0.999), 'real_bsz': 1, 'run_name': '65536 ctx1024 L24 D2048'} Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... RWKV_MY_TESTING Traceback (most recent call last): File "/mnt/d/LS/Rwkv/./finetune/lora/v5/train.py", line 308, in from src.trainer import train_callback, generate_init_weight File "/mnt/d/LS/Rwkv/finetune/lora/v5/src/trainer.py", line 6, in from .model import LORA_CONFIG File "/mnt/d/LS/Rwkv/finetune/lora/v5/src/model.py", line 56, in wkv5_cuda = load( File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile _write_ninja_file_and_build_library( File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1597, in _write_ninja_file_and_build_library get_compiler_abi_compatibility_and_version(compiler) File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 336, in get_compiler_abi_compatibility_and_version if not check_compiler_ok_for_platform(compiler): File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 290, in check_compiler_ok_for_platform which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT) File "/usr/lib/python3.10/subprocess.py", line 421, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.

没有出现 gcc installed; requirements satisfied

再次点击训练, 看是否出现了gcc installed; requirements satisfied

SerMs avatar Jan 12 '24 06:01 SerMs

有几个依赖在请求的时候为443,是不是这个原因?需要借助梯子嘛?

SerMs avatar Jan 12 '24 06:01 SerMs

WSL执行一下 sudo apt update

josStorer avatar Jan 12 '24 06:01 josStorer

LoRA additionally training parameter time_mix_r LoRA additionally training module blocks.8.ffn.key LoRA additionally training module blocks.8.ffn.receptance INFO:pytorch_lightning.utilities.rank_zero:########## Loading models/RWKV-5-World-0.1B-v1-20230803-ctx4096.pth... ########## LoRA additionally training module blocks.8.ffn.value LoRA additionally training module blocks.9.ln1 LoRA additionally training module blocks.9.ln2 LoRA additionally training parameter time_mix_k LoRA additionally training parameter time_mix_v LoRA additionally training parameter time_mix_r LoRA additionally training parameter time_mix_g LoRA additionally training parameter time_decay LoRA additionally training parameter time_faaaa LoRA additionally training module blocks.9.att.receptance LoRA additionally training module blocks.9.att.key LoRA additionally training module blocks.9.att.value LoRA additionally training module blocks.9.att.gate LoRA additionally training module blocks.9.att.ln_x LoRA additionally training parameter time_mix_k LoRA additionally training parameter time_mix_r LoRA additionally training module blocks.9.ffn.key LoRA additionally training module blocks.9.ffn.receptance LoRA additionally training module blocks.9.ffn.value LoRA additionally training module blocks.10.ln1 LoRA additionally training module blocks.10.ln2 LoRA additionally training parameter time_mix_k LoRA additionally training parameter time_mix_v LoRA additionally training parameter time_mix_r LoRA additionally training parameter time_mix_g LoRA additionally training parameter time_decay LoRA additionally training parameter time_faaaa LoRA additionally training module blocks.10.att.receptance LoRA additionally training module blocks.10.att.key LoRA additionally training module blocks.10.att.value LoRA additionally training module blocks.10.att.gate LoRA additionally training module blocks.10.att.ln_x LoRA additionally training parameter time_mix_k LoRA additionally training parameter time_mix_r LoRA additionally training module blocks.10.ffn.key LoRA additionally training module blocks.10.ffn.receptance LoRA additionally training module blocks.10.ffn.value LoRA additionally training module blocks.11.ln1 LoRA additionally training module blocks.11.ln2 LoRA additionally training parameter time_mix_k LoRA additionally training parameter time_mix_v LoRA additionally training parameter time_mix_r LoRA additionally training parameter time_mix_g LoRA additionally training parameter time_decay LoRA additionally training parameter time_faaaa LoRA additionally training module blocks.11.att.receptance LoRA additionally training module blocks.11.att.key LoRA additionally training module blocks.11.att.value LoRA additionally training module blocks.11.att.gate LoRA additionally training module blocks.11.att.ln_x LoRA additionally training parameter time_mix_k LoRA additionally training parameter time_mix_r LoRA additionally training module blocks.11.ffn.key LoRA additionally training module blocks.11.ffn.receptance LoRA additionally training module blocks.11.ffn.value Traceback (most recent call last): File "/mnt/d/LS/Rwkv/./finetune/lora/v5/train.py", line 379, in model.load_state_dict(load_dict, strict=(not args.lora)) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1671, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for RWKV: size mismatch for blocks.0.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.0.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.1.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.1.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.1.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.2.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.2.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.2.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.3.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.3.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.3.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.4.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.4.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.4.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.5.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.5.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.5.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.6.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.6.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.6.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.7.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.7.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.7.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.8.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.8.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.8.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.9.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.9.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.9.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.10.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.10.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.10.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]). size mismatch for blocks.11.att.time_decay: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([12, 64]). size mismatch for blocks.11.ffn.key.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([2688, 768]). size mismatch for blocks.11.ffn.value.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([768, 2688]).

按照你的做了时候重新试了一下,但是又卡了,卡在这里半个小时没动静了,这是报错了还是怎么回事呢

SerMs avatar Jan 12 '24 08:01 SerMs

试试RWKV5-1.5B, RWKV5的小尺寸模型版本可能lora没有适配, RWKV5有几个小版本的

josStorer avatar Jan 12 '24 11:01 josStorer