qlora
qlora copied to clipboard
Error while trying to run training in Windows
Error invalid device ordinal at line 359 in file G:\F\Projects\AI\text-generation-webui\bitsandbytes\csrc\pythonInterface.c C:\arrow\cpp\src\arrow\filesystem\s3fs.cc.2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit
This error is thrown just when the training loop starts, and the terminal remains stuck and unresponsive.
Happens to me on Windows too, but looks same as #3, so likely not Windows specific.
Hmm, #3 seemed like caused by to old transformers version (without PRs). I doublechecked and I do have newest transformers with the PRs, yet the issue still happens.
Ok, this might be Windows specific. The problem is on cudaMemPrefetchAsync()
and stack overflow suggest GPU may not support this feature.
I wrote this code to check if my GPU supports it:
#include <iostream>
#include <cuda_runtime.h>
int main() {
int deviceCount;
cudaGetDeviceCount(&deviceCount);
if (deviceCount == 0) {
std::cout << "No CUDA capable devices found." << std::endl;
return 0;
}
for (int i = 0; i < deviceCount; ++i) {
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, i);
if (deviceProp.concurrentManagedAccess) {
std::cout << "GPU " << i << " supports concurrent managed access." << std::endl;
} else {
std::cout << "GPU " << i << " does not support concurrent managed access." << std::endl;
}
}
return 0;
}
And turns out both of my 3090s doesn't support it on my Windows machine.
From https://developer.nvidia.com/blog/unified-memory-cuda-beginners/:
3 The device attribute concurrentManagedAccess tells whether the GPU supports hardware page migration and the concurrent access functionality it enables. A value of 1 indicates support. At this time it is only supported on Pascal and newer GPUs running on 64-bit Linux.
So maybe they never enabled it on non-64 bit Linux?
Edit: yeah, likely still no Windows support for that https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#system-requirements :
GPUs with SM architecture 6.x or higher (Pascal class or newer) provide additional Unified Memory features such as on-demand page migration and GPU memory oversubscription that are outlined throughout this document. Note that currently these features are only supported on Linux operating systems. Applications running on Windows (whether in TCC or WDDM mode) will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.
Good news is that this cudaMemPrefetchAsync()
call may be not required for code to work - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8dc9199943d421bc8bc7f473df12e42:
Note that this API is not required for functionality and only serves to improve performance by allowing the application to migrate data to a suitable location before it is accessed. Memory accesses to this range are always coherent and are allowed even when the data is actively being migrated.
I've created issue for this in TimDettmers/bitsandbytes#453 .
The bad news is that likely this Paged Optimizer (to avoid OoM due to memory spikes) will likely won't work as advertised on Windows :(
I'm trying this on Tesla V100S, which has compute capability 7.0, which satisfies the requirements for training. Also, I am able to do the training on the same GPU on an Ubuntu 20.04 system.
@stoperro, can you please share with us a copy of your latest compiled bitsandbytes-0.39.0-for-windows.dll with your hot fix aforementioned? I'm agonized by this error and unable to compile the codes by myself. Much appreciated!
@johnny0213 this is my latest compiled, but I did it around 1 month ago - https://github.com/stoperro/bitsandbytes_windows/releases/tag/pre-v0.39.0-win0 , so it's not based on literally latest version of bitsandbytes. It was working for me to run qlora though.
As downloading binaries from unknown people is dangerous, I would recommend to still try to compile (after reviewing the changes) the binaries from scratch - maybe this will help https://github.com/TimDettmers/bitsandbytes/issues/30.
@stoperro My gratitude. Now qlora is running just fine and dandy. Also thanks for your reminder and will try to compile by myself another day.