Training loss curve on V2
You updated to V2 version but it only support 1 batch size. How much computation cost you do when compared to V1? I also want to see the loss curve so since I started to training it with only 1 A100 and loss for AE task is about ~2.xx
In the pretrain, using the Wikipedia dataset and using the learning rate of 1e-4 can help jump out of the local optimal solution, and the loss can be reduced to 0.xx
@Zyvpeng thanks for the response. I see in the paper you mention that loss is under 0.1 So 0.xx here should be 0.0x? I have made progress in training it but it looks like only about 0.4x
@toilaluan When I train ICAE with lm_ratio=0, the loss can reach under 0.1. However, when I set lm_ration=0.4, I face the same problem as you, so what's your lm_ratio? By the way, did the author mention anything about lm_ratio? Thank you!
@toilaluan When I train ICAE with lm_ratio=0, the loss can reach under 0.1. However, when I set lm_ration=0.4, I face the same problem as you, so what's your lm_ratio? By the way, did the author mention anything about lm_ratio? Thank you!
It is written in the paper, and it is recommended to set it between 0.4 and 0.6.
map(pretrain_dataset = train_dataset.map(pretrain_tokenize_function, batched=True, batch_size=1, fn_kwargs={"model": model, "mem": MEM_TOKENS, "lm_ratio": training_args.lm_ratio}) After obtaining data here, the amount of data displayed during training is not the same. Is this the case for you? My progress bar is 25,000
Hi folks, do you have anything to share? I am also trying to reproduce pretraining, but it does seem to be very slow (too slow). @getao you mentioned you used "tens of billion tokens", can you also share approximate training times for you (e.g. # days)?
@Kirili4ik There is a trick that you set position identifier for soft-tokens can make model converge quickly
https://arxiv.org/pdf/2409.14364v2
Example:
- context positions: [1,2,3,4,5,6,7,8]
- soft-tokens icae positions: [9, 10, 11, 12] (example of 4 soft tokens)
- better soft-tokens positions: [1, 3, 6, 9]
For RoPE, attention weights decays in an exponential way with
respect to the relative token positions, specified by
the position identifiers.
I was trying to play around with this here: https://github.com/condenses/condense-trainer/blob/main/compress/modeling/causal_lm.py
Wow @toilaluan thanks for the paper!!! The code at your link does not open though. maybe there is a misspelling or it is private?
@Kirili4ik There is a trick that you set position identifier for soft-tokens can make model converge quickly
https://arxiv.org/pdf/2409.14364v2
Example:
- context positions: [1,2,3,4,5,6,7,8]
- soft-tokens icae positions: [9, 10, 11, 12] (example of 4 soft tokens)
- better soft-tokens positions: [1, 3, 6, 9]
For RoPE, attention weights decays in an exponential way with respect to the relative token positions, specified by the position identifiers.I was trying to play around with this here: https://github.com/condenses/condense-trainer/blob/main/compress/modeling/causal_lm.py
Thank you for your attention to our work. You can check out this link. https://arxiv.org/abs/2409.14364