lksysML comments

Results 12 comments of


                                            lksysML

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

> > @QinlongHuang Make sure your batch setting is correct. You can check more details here #188. > > Thx for your reply ! But it not works for me......

Cuda OOM when fine-tuning 13B

Running into the same issue. Getting OOM after 7-10% while running on 4x A100-40GB. Started at --micro_batch_size=24 and have been reducing it till 8 and it still OOMs at around...

Cuda OOM when fine-tuning 13B

Tried setting max_split_size_mb to 128mb and 64mb. Still didn't help, errors out at 10% when I think it is checkpointing or something

Cuda OOM when fine-tuning 13B

Yes > I retrained 7b without any issues. For 13B, I tried a couple of things but to no avail: > > 1. Use a smaller `cutoff_len = 256` >...

Cuda OOM when fine-tuning 13B

Usually errors out when it reaches 200 iterations. @tloen What do you think? I rented 8x RTX 3090 and getting same issue there. At 10% or 200 iterations it errors...

Cuda OOM when fine-tuning 13B

I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change...

CUDA out of memory

Same error: https://github.com/tloen/alpaca-lora/issues/344 It errors out at 200 iterations. @tloen

CUDA out of memory

I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change...

CUDA out of memory

> I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me Super!

fix issues to be compatible with latest peft

Your pull request isn't working. Crashed when it tries to save a checkpoint, was training on 8x RTX 3090.