qlora icon indicating copy to clipboard operation
qlora copied to clipboard

Error while trying to run training in Windows

Open amdnsr opened this issue 1 year ago • 9 comments

Error invalid device ordinal at line 359 in file G:\F\Projects\AI\text-generation-webui\bitsandbytes\csrc\pythonInterface.c C:\arrow\cpp\src\arrow\filesystem\s3fs.cc.2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

This error is thrown just when the training loop starts, and the terminal remains stuck and unresponsive.

amdnsr avatar May 28 '23 21:05 amdnsr

Happens to me on Windows too, but looks same as #3, so likely not Windows specific.

stoperro avatar May 29 '23 04:05 stoperro

Hmm, #3 seemed like caused by to old transformers version (without PRs). I doublechecked and I do have newest transformers with the PRs, yet the issue still happens.

stoperro avatar May 29 '23 05:05 stoperro

Ok, this might be Windows specific. The problem is on cudaMemPrefetchAsync() and stack overflow suggest GPU may not support this feature.

I wrote this code to check if my GPU supports it:

#include <iostream>
#include <cuda_runtime.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);

    if (deviceCount == 0) {
        std::cout << "No CUDA capable devices found." << std::endl;
        return 0;
    }

    for (int i = 0; i < deviceCount; ++i) {
        cudaDeviceProp deviceProp;
        cudaGetDeviceProperties(&deviceProp, i);

        if (deviceProp.concurrentManagedAccess) {
            std::cout << "GPU " << i << " supports concurrent managed access." << std::endl;
        } else {
            std::cout << "GPU " << i << " does not support concurrent managed access." << std::endl;
        }
    }

    return 0;
}

And turns out both of my 3090s doesn't support it on my Windows machine.

From https://developer.nvidia.com/blog/unified-memory-cuda-beginners/:

3 The device attribute concurrentManagedAccess tells whether the GPU supports hardware page migration and the concurrent access functionality it enables. A value of 1 indicates support. At this time it is only supported on Pascal and newer GPUs running on 64-bit Linux.

So maybe they never enabled it on non-64 bit Linux?

Edit: yeah, likely still no Windows support for that https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#system-requirements :

GPUs with SM architecture 6.x or higher (Pascal class or newer) provide additional Unified Memory features such as on-demand page migration and GPU memory oversubscription that are outlined throughout this document. Note that currently these features are only supported on Linux operating systems. Applications running on Windows (whether in TCC or WDDM mode) will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.

stoperro avatar May 29 '23 05:05 stoperro

Good news is that this cudaMemPrefetchAsync() call may be not required for code to work - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8dc9199943d421bc8bc7f473df12e42:

Note that this API is not required for functionality and only serves to improve performance by allowing the application to migrate data to a suitable location before it is accessed. Memory accesses to this range are always coherent and are allowed even when the data is actively being migrated.

stoperro avatar May 29 '23 05:05 stoperro

I've created issue for this in TimDettmers/bitsandbytes#453 .

The bad news is that likely this Paged Optimizer (to avoid OoM due to memory spikes) will likely won't work as advertised on Windows :(

stoperro avatar May 29 '23 06:05 stoperro

I'm trying this on Tesla V100S, which has compute capability 7.0, which satisfies the requirements for training. Also, I am able to do the training on the same GPU on an Ubuntu 20.04 system.

amdnsr avatar May 29 '23 08:05 amdnsr

@stoperro, can you please share with us a copy of your latest compiled bitsandbytes-0.39.0-for-windows.dll with your hot fix aforementioned? I'm agonized by this error and unable to compile the codes by myself. Much appreciated!

johnny0213 avatar Jun 15 '23 03:06 johnny0213

@johnny0213 this is my latest compiled, but I did it around 1 month ago - https://github.com/stoperro/bitsandbytes_windows/releases/tag/pre-v0.39.0-win0 , so it's not based on literally latest version of bitsandbytes. It was working for me to run qlora though.

As downloading binaries from unknown people is dangerous, I would recommend to still try to compile (after reviewing the changes) the binaries from scratch - maybe this will help https://github.com/TimDettmers/bitsandbytes/issues/30.

stoperro avatar Jun 15 '23 08:06 stoperro

@stoperro My gratitude. Now qlora is running just fine and dandy. Also thanks for your reminder and will try to compile by myself another day.

johnny0213 avatar Jun 16 '23 01:06 johnny0213