lksysML

Results 12 comments of lksysML

> > @QinlongHuang Make sure your batch setting is correct. You can check more details here #188. > > Thx for your reply ! But it not works for me......

Running into the same issue. Getting OOM after 7-10% while running on 4x A100-40GB. Started at --micro_batch_size=24 and have been reducing it till 8 and it still OOMs at around...

Tried setting max_split_size_mb to 128mb and 64mb. Still didn't help, errors out at 10% when I think it is checkpointing or something

Yes > I retrained 7b without any issues. For 13B, I tried a couple of things but to no avail: > > 1. Use a smaller `cutoff_len = 256` >...

Usually errors out when it reaches 200 iterations. @tloen What do you think? I rented 8x RTX 3090 and getting same issue there. At 10% or 200 iterations it errors...

I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change...

Same error: https://github.com/tloen/alpaca-lora/issues/344 It errors out at 200 iterations. @tloen

I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change...

> I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me Super!

Your pull request isn't working. Crashed when it tries to save a checkpoint, was training on 8x RTX 3090.