ChatGLM-Finetuning issues

Results 67 ChatGLM-Finetuning issues

Sort by recently updated

执行 train.py过程报错 exits with return code = -9

大佬好，当我使用执行多卡训练时，执行指令 ``` CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py \ --train_path data/spo_0.json \ --model_name_or_path ChatGLM-6B/ \ --per_device_train_batch_size 1 \ --max_len 1560 \ --max_src_len 1024 \ --learning_rate 1e-4 \ --weight_decay 0.1 \ --num_train_epochs...

yhx0105

关于 RuntimeError: element 0 or 1 of tensors does not require grad and does not have a grad_fn的问题讨论

此问题针对v0.1版本中的pipeline并行的方式我在使用pipelinemodel的时候，发现出现过两个如标题所属的问题其中，1的问题比较好解决，正如楼下所说，那就是在forward的过程中不要产生新的叶子tensor。但是在这里的代码里embedding层中就设置了mask之类的变量，然后到模型的其他部分，这个设置为什么不报错？ 0的问题主要出现在设置activation-checkpoint的时候，只要在 model_pipe = PipelineModule(layers=get_model(model), num_stages=args.num_stages,partition_method = 'parameters',activation_checkpoint_interval=1)这里的activation_checkpoint_interval变量设置>0,则会报错。分析可能是某个环境的开关没有开，或者说这种get_model函数的构造方式可能有一定的问题？

karots123

请问是否支持断点续训？lora和全参微调

例如像别的repo是启动脚本里加checkpoint_dir参数指向断点的模型路径，继续训

lianglinyi

请问分类任务的代码中哪里呀？

@liucongg 请问分类任务的代码中哪里呀？仓库中没有找到对应的代码，望指教！

Alan-JW

Is 1560 the minimum input sequence length for training?

Hi, @liucongg Below line 123, if I understand right, wants to pad the batched sentences to at least 1560, which is quite different with batch inference. During inference, we usually...

Shuai-Xie

Is this line wrong?

https://github.com/liucongg/ChatGLM-Finetuning/blob/d52202a2facfa1dff45b0daec7e56aa54c126616/utils.py#L95-L96 I am not very sure. But this seems the right code: ` input_ids = tokenizer.convert_tokens_to_ids(tokens) + [tokenizer.get_command("[gMASK]"), tokenizer.get_command("sop")] `

iridescentee

关于Pipeline Parallelism中Dataloader的Sampler的问题

Hi, 你的pipeline parallelism的封装写的很好，谢谢！我有一个问题是关于dataloader的sampler的。正常来说我们如果启动分布式训练，会用DistributedSampler，即每个进程实际上是取dataset的不同的shard中的数据。在你的代码中我看到你是直接指定了`shuffle=True`，是否可以理解为在这里，N个进程其实取到的数据是一样的，没有shard。或者说不一样也没关系，实际只有进程`0`的数据会完整走完整个pipeline，其他进程的数据会被忽略？我在Deepspeed里没有找到这部分的源码，在相应的tutorial里也没有发现。我理解正常naive model parallel其实是只启动了一个进程所以不会有这个问题，但是deepspeed默认有N张卡就会启动N个进程，所以比较令人费解…… 不知道你有没有探索过这个问题，谢谢！ Best, Fangkai

SparkJiao

ChatGLM-Finetuning
ChatGLM-Finetuning copied to clipboard

Metadata

执行 train.py过程报错 exits with return code = -9

关于 RuntimeError: element 0 or 1 of tensors does not require grad and does not have a grad_fn的问题讨论

请问是否支持断点续训？lora和全参微调

请问分类任务的代码中哪里呀？

Is 1560 the minimum input sequence length for training?

Is this line wrong?

关于Pipeline Parallelism中Dataloader的Sampler的问题

为什么在保存模型的时候，会调用model.train()呢

博主，能不能解释一下每个py文件的作用

关于GLM1 Tokenizer的疑问

← Metadata

Owner

Metadata

ChatGLM-Finetuning ChatGLM-Finetuning copied to clipboard

Metadata

← Metadata

Owner

Metadata

ChatGLM-Finetuning
ChatGLM-Finetuning copied to clipboard