ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

[BUG/Help] <how to train instances with long prompt>

Open HANiFLY opened this issue 1 year ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

现在的模型会把超出最长限度的文本截断,但是这在做文本摘要的时候可能出现标签(输出)的部分内容事实上对应的是被截断的文本。请问如何解决这一问题,即保留完整文本进行训练? The current model will truncate the text beyond the maximum limit, but when doing text summarization, part of the content of the label (output) may actually correspond to the truncated text. May I ask how to solve this problem, i.e. keep the full text for training?

Expected Behavior

No response

Steps To Reproduce

trian with long prompt with like 10000 tokens

Environment

- OS:Ubuntu 18.04
- Python:3.9
- Transformers: 4.27.1
- PyTorch:2.0.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Anything else?

No response

HANiFLY avatar Apr 26 '23 03:04 HANiFLY

not sure if you can do this via tune max_source_length/max_target_length/pre_seq_len if you have enough vram?

yc-huang avatar Apr 28 '23 05:04 yc-huang