GLM-4 icon indicating copy to clipboard operation
GLM-4 copied to clipboard

GLM4的微调代码似乎evaluation时未做优化

Open FloatFrank opened this issue 4 months ago • 0 comments

System Info / 系統信息

最新版LoRA微调,除了batch其他参数默认,evaluation的batch_size设置1,训练batch_size也是1,训练哪怕是2以上都正常,但是evaluation时就会内存分页大小问题。是否需要train->GC->eval->train->...? 2080Ti 22G,实在是没有足够的显存可以train的基础上再加入evaluation

报错(故意调成10来eval,复现报错): {'loss': 2.4602, 'grad_norm': 6.9103498458862305, 'learning_rate': 0.0004991666666666666, 'epoch': 0.0}
0%|▎ | 10/6000 [01:03<2:28:24, 1.49s/it] ***** Running Evaluation ***** Num examples = 600 Batch size = 1 Traceback (most recent call last): File "", line 1, in File "E:\Conda Envs\glm-4-demo\Lib\multiprocessing\spawn.py", line 122, in spawn_main exitcode = _main(fd, parent_sentinel) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "E:\Conda Envs\glm-4-demo\Lib\multiprocessing\spawn.py", line 131, in _main prepare(preparation_data) File "E:\Conda Envs\glm-4-demo\Lib\multiprocessing\spawn.py", line 246, in prepare _fixup_main_from_path(data['init_main_from_path']) File "E:\Conda Envs\glm-4-demo\Lib\multiprocessing\spawn.py", line 297, in _fixup_main_from_path main_content = runpy.run_path(main_path, ^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 286, in run_path File "", line 98, in _run_module_code File "", line 88, in run_code File "...\fine_tune\glm_lora_change\GLM-4-MineTuningVersion\finetune_demo\finetune.py", line 11, in import torch File "E:\Conda Envs\glm-4-demo\Lib\site-packages\torch_init.py", line 137, in raise err OSError: [WinError 1455] 页面文件太小,无法完成操作。 Error loading "E:\Conda Envs\glm-4-demo\Lib\site-packages\torch\lib\cublas64_12.dll" or one of its dependencies.

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • [ ] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

python .\finetune.py data//nov_glm_datasets THUDM/glm-4-9b-chat configs/lora.yaml

Expected behavior / 期待表现

期待正常微调,eval时不额外加载模型推理

FloatFrank avatar Oct 11 '24 08:10 FloatFrank