Yu-won Lee
Yu-won Lee
위에 올려주신 사용중이신 bash 파일에 offloading 사용중이셔서 제안드렸었습니다. 환경을 새로 다시 구성해서 해보겠습니다.
Well I've changed the server and it shows the same error now. I'm a bit confused whats the problem. I was just using the docker with older pytorch version such...
My code isn't for unsloth but, I've added liger-kernel training. I didn't tested the minimum vram but, I've used at 5 x V100 gpus with cpu offloading. I'll try to...
@linkfzaro Liger-kernel is similar to unsloth that is optimized with tirotn kernel. To the best to my knowledge, unsloth is a framework that optimizes the model with triton kernel. That...
@Oguzhanercan You could comment the line using liger_kernel cuz, it dosen't support qlora for it. https://github.com/2U1/Llama3.2-Vision-Finetune/blob/9c6821e95a6e962600ecc654b7e545d4a3dd316a/src/training/train.py#L71C5-L71C33 This is the line. l'll make a option for turning it on and off....
@Oguzhanercan Offloading takes much longer than non-offloading. I'll find some other way to improve the speed.
It's odd that I enabled to grad for input embedding. It is caused by gradient checkpointing. I'll check for this.
@gumona12 Okay I found the reason. Some the gradietns are set to False when you freeze the vision tower using gradient checkpointing. It doesn't matter when training, so it just...
I think the version of your cuda and the environment is a bit different. Can you reinstall the torch to your current cuda version and retry?
The error code says you have cuda 12.0. Thats odd. Think it's some kinda version issue. I'll try to get it.