icae Training loss curve on V2

You updated to V2 version but it only support 1 batch size. How much computation cost you do when compared to V1? I also want to see the loss curve so since I started to training it with only 1 A100 and loss for AE task is about ~2.xx

Oct 07 '24 03:10 toilaluan

In the pretrain, using the Wikipedia dataset and using the learning rate of 1e-4 can help jump out of the local optimal solution, and the loss can be reduced to 0.xx

Dec 18 '24 05:12 Zyvpeng

@Zyvpeng thanks for the response. I see in the paper you mention that loss is under 0.1 So 0.xx here should be 0.0x? I have made progress in training it but it looks like only about 0.4x

Dec 29 '24 06:12 toilaluan

@toilaluan When I train ICAE with lm_ratio=0, the loss can reach under 0.1. However, when I set lm_ration=0.4, I face the same problem as you, so what's your lm_ratio? By the way, did the author mention anything about lm_ratio? Thank you!

Dec 31 '24 14:12 Zyvpeng

@toilaluan When I train ICAE with lm_ratio=0, the loss can reach under 0.1. However, when I set lm_ration=0.4, I face the same problem as you, so what's your lm_ratio? By the way, did the author mention anything about lm_ratio? Thank you!

It is written in the paper, and it is recommended to set it between 0.4 and 0.6.

map(pretrain_dataset = train_dataset.map(pretrain_tokenize_function, batched=True, batch_size=1, fn_kwargs={"model": model, "mem": MEM_TOKENS, "lm_ratio": training_args.lm_ratio}) After obtaining data here, the amount of data displayed during training is not the same. Is this the case for you? My progress bar is 25,000

Jan 06 '25 13:01 lx-Meteors

Hi folks, do you have anything to share? I am also trying to reproduce pretraining, but it does seem to be very slow (too slow). @getao you mentioned you used "tens of billion tokens", can you also share approximate training times for you (e.g. # days)?

Jun 30 '25 14:06 Kirili4ik

@Kirili4ik There is a trick that you set position identifier for soft-tokens can make model converge quickly

https://arxiv.org/pdf/2409.14364v2

Example:

context positions: [1,2,3,4,5,6,7,8]
soft-tokens icae positions: [9, 10, 11, 12] (example of 4 soft tokens)
better soft-tokens positions: [1, 3, 6, 9]

For RoPE, attention weights decays in an exponential way with
respect to the relative token positions, specified by
the position identifiers.

I was trying to play around with this here: https://github.com/condenses/condense-trainer/blob/main/compress/modeling/causal_lm.py

Jul 02 '25 14:07 toilaluan

Wow @toilaluan thanks for the paper!!! The code at your link does not open though. maybe there is a misspelling or it is private?

Jul 02 '25 14:07 Kirili4ik

@Kirili4ik There is a trick that you set position identifier for soft-tokens can make model converge quickly

https://arxiv.org/pdf/2409.14364v2

Example:

context positions: [1,2,3,4,5,6,7,8]

soft-tokens icae positions: [9, 10, 11, 12] (example of 4 soft tokens)

better soft-tokens positions: [1, 3, 6, 9]
For RoPE, attention weights decays in an exponential way with
respect to the relative token positions, specified by
the position identifiers.
I was trying to play around with this here: https://github.com/condenses/condense-trainer/blob/main/compress/modeling/causal_lm.py

Thank you for your attention to our work. You can check out this link. https://arxiv.org/abs/2409.14364

Jul 02 '25 15:07 lx-Meteors