gpt4all Added cuda and opencl support

This PR aims to add support for CUDA and OpenCL. Once ready, I'll need someone to test CUDA support since I don't own an Nvidia card myself.

Testing instructions

Just a warning, old models as downloaded automatically will not work properly with OpenCL. Currently, it makes the GUI freeze, but that's some change on the GUI side that needs to be done. Old llama.cpp simply doesn't support them. Download a GGML model from here: https://huggingface.co/TheBloke and place it in your models folder. Make sure it starts with ggml-! The GUI might attempt to load another model on the way there and crash, since updating that won't be part of this PR. To prevent this, move the other models somewhere else.

To make the GUI actually use the GPU, you'll need to add either buildVariant = "cuda"; or buildVariant = "opencl"; after this line: https://github.com/tuxifan/gpt4all/blob/dlopen_gpu/gpt4all-backend/llmodel.cpp#L69

We also need some people testing on Windows with AMD graphics cards! And some people on Linux testing on Nvidia.

May 28 '23 13:05 niansa

This is a very important improvement but will have to be carefully tested.

We need to test that

GPU support works on windows and linux machines with Nvidia graphics cards.
Either the chat client or one set of bindings can effectively utilize the support.

May 28 '23 19:05 AndriyMulyar

I'll be happy to test it on Windows 10, maybe even Linux. NVidia 3060. Just ping me when you think it's in a good-enough state.

May 28 '23 20:05 cosmic-snow

I'm happy to test as well. I have a windows machine with 3090

May 29 '23 02:05 ani1797

I'd be happy to test on my Windows 10 machine. Cuda is installed already.

edit: GPU is 1660 Ti

May 29 '23 17:05 chadnice

Wonderful! Thanks everyone :slightly_smiling_face:

May 29 '23 19:05 niansa

Just a warning, old models as downloaded automatically will not work properly with OpenCL. Currently, it makes the GUI freeze, but that's some change on the GUI side that needs to be done. Old llama.cpp simply doesn't support them.

May 29 '23 19:05 niansa

Here to help with testing on Windows 11, RTX 3090.

May 29 '23 20:05 maiko

Here to help with testing on Windows 11, RTX 3060ti.Thanks everyone!

May 29 '23 21:05 duouoduo

I've added testing instructions to the top post. :-)

May 29 '23 22:05 niansa

Hello ! Thanks for the hard work.

I'm on Linux with iris xe integrated GPU (OpenCL compatible). Is there any chance of working ? I've forced "buildVariant = "opencl" in the code as specified above. Backend and chat built without any errors.

But when I launch "chat", it just stays forever without doing nothing (neither consuming CPU or RAM) with only a message "deserializing chats took: 0 ms"

I use the 13b snoopy model, it works perfectly on the main Nomic branch

May 30 '23 10:05 pierreduf

OK I finally made it working !

First of all, I had openCL libs and headers but not CLBlast (I overlooked the cmake warning). I built it from there as the version include in my repos (Ubuntu 20.04) did not work : https://github.com/CNugteren/CLBlast

I also downloaded a new model (https://huggingface.co/TheBloke/samantha-13B-GGML/tree/main) as the snoozy one did not work (as you specified in the first message, sorry for having read too fast)

I now have a working OpenCL setup ! Hope it can helps others. But unfortunately it does not speed anything :D (my integrated GPU is probably not very suited for that).

Any idea about how I could speed that up ?

qt.dbus.integration: Could not connect "org.freedesktop.IBus" to globalEngineChanged(QString)
deserializing chats took: 0 ms
llama.cpp: loading model from /opt/gpt4all/gpt4allgpu//ggml-samantha-13b.ggmlv3.q5_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0,09 MB
ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics'
ggml_opencl: selecting device: 'Intel(R) Gen12LP HD Graphics NEO'
ggml_opencl: device FP16 support: true
llama_model_load_internal: mem required  = 10583,26 MB (+ 1608,00 MB per state)

My GPU capabilities (using the OpenCL API) are below:

GPU VRAM Size: 25440 MB
Number of Compute Units: 96

May 30 '23 13:05 pierreduf

Hello ! Thanks for the hard work.

I'm on Linux with iris xe integrated GPU (OpenCL compatible). Is there any chance of working ? I've forced "buildVariant = "opencl" in the code as specified above. Backend and chat built without any errors.

But when I launch "chat", it just stays forever without doing nothing (neither consuming CPU or RAM) with only a message "deserializing chats took: 0 ms"

I use the 13b snoopy model, it works perfectly on the main Nomic branch

Please reread the first message, the old models aren't supported :-)

May 31 '23 04:05 niansa

Any idea about how I could speed that up ?

Nope. Integrated graphics are pretty much unsuitable for this. But this should be enough to show that it's working! Thank you for giving it a try :-)

May 31 '23 04:05 niansa

Thank you for your answer that's what I thought :(. Just out of curiosity, what would be the limiting factor for such a iGPU : RAM because it's shared with the system (the GPU has indeed only 128 MB from what I understand) or just the number of computing cores ? Or something else ?

I did more tests and I notice something strange : I'm using intel_gpu_top to see the GPU usage. It's clearly used when I'm looking at a 4k 60 fps video on Youtube (=> hw acceleration), but it seems to be not used at all with GPT4All (GPU version). Do I miss something ?

May 31 '23 07:05 pierreduf

Sorry, it's taking a bit longer. I hadn't actually compiled anything with MSVC in a while -- and it looks like that's the way to go -- so I'm now dealing with some fun build errors (who doesn't love those!). Although I've already knocked a few down by installing/upgrading certain things.

Question: are there known minimum requirements for the things involved in the whole toolchain?

I'm now up-to-date with some things and using VS Community 2022 and Windows 10 SDK v10.0.20348.0, so newer than that isn't possible anyway (for win10). Still relying on an older CUDA SDK (v11.6), however. Might just have to go update that, too, if nothing else helps.

I should probably go and have closer look at the llama.cpp project.

May 31 '23 12:05 cosmic-snow

Yeah, CUDA setup should be documented in the llama.cpp repo

May 31 '23 14:05 niansa

I'm a bit reluctant to turn this into a troubleshooting session here -- in a pull request comment of all places -- but what I've seen so far might help others who want to try CUDA.

Well, it's quite weird with MSVC to say the least. So far I've run into the following problems. This was still before the forced pull/merge yesterday, which has helped quite a bit now:

Note: in all of the following, I've used ...\vcvarsall.bat x64 and I was trying to simply build the backend itself in a first step. I worked locally with a git fetch origin pull/746/head:trying-cuda; git checkout trying-cuda.
Some earlier problems got resolved by updating to Visual Studio 2022 and the latest Windows 10 SDK. I'm not going to go into detail about those. I'm still on CUDA v11.6, however. It doesn't seem to be a problem, after all.
many errors in gpt4all-backend\llama.cpp-mainline\ggml-cuda.cu with message: error : expected an expression
- Was a very puzzling error initially, because the GGML_CUDA_DMMV_X/GGML_CUDA_DMMV_Y this pointed to were simple #defines in the code. Turns out that for some reason, these #defines are overridable through cmake compiler options and are actually set in the config -- only those settings were somehow not passed through in the end. Resolved by manually editing the relevant .vcxproj file by changing all relevant compiler invocations.
- Resolved. This doesn't happen anymore since the force push.
Warning about a feature requiring C++ standard 20.
- Fixed by editing CMakeLists.txt and replacing set(CMAKE_CXX_STANDARD 17) with set(CMAKE_CXX_STANDARD 20)
- Resolved. This isn't necessary anymore since the force push/merge.
minor problem: warning C5102: ignoring invalid command-line macro definition '/arch:AVX2' but /arch:AVX2 is a perfectly valid flag in MSVC.
- I've figured out why it happens: it's following a /D, but is not about setting a macro definition. It's a valid flag by itself. Have not figured out why it's generated that way, though.
- Doesn't occur when compiling the main branch, it seems?
- Still happens after the force push/merge.
Main problem: Build errors in many projects: error MSB3073: ... <many script lines omitted> ... :VCEnd" exited with code -1073741819.
- code -1073741819 is hexadecimal 0xC000 0005 which is seems to be the code for an access violation. Yikes. Did my compiler just crash?
- Found this and this as potentially talking about the same problem. The former is a downvoted and unanswered SO question, and the latter says to disable the /GL compiler flag (not tried before the force push).
- Still seeing these errors after the force push/merge.
- So far I did everything on the command line. This was somehow resolved by opening the .sln in Visual Studio and building the whole thing twice (after the first run showed the same errors). (???)

(Of course, I cannot exclude the possibility that all of this is yet another case of PEBKAC.)

=> So now I have managed to have a compiled backend, at last.

P.S. I could also try compiling everything with a MinGW setup (I prefer MSYS2 MinGW here). Is that something that's supposed to be supported in the future? I've invested quite some time to help troubleshoot problems there (mainly in 758, 717 and 710) and I guess it's not a good user experience -- but that also has to do with the Python bindings package.

Jun 01 '23 11:06 cosmic-snow

Some compile issue on MSVC has been found and will be solved soon @cosmic-snow! Will notify you about more.

Jun 01 '23 16:06 niansa

Oh really? That's good to know. But not urgent, because here's where I am now:

I tried compiling the backend by itself so I might get away with just testing through the Python bindings.
Turns out, the C API has changed, too. So I decided to finally do the full setup and download Qt Creator.
Some time and a few gigabytes later, it wasn't very hard to configure, most of the things were set correctly out of the box (I did have to compile this one twice, too, but that's a minor inconvenience). The only thing I changed was CMAKE_GENERATOR to Visual Studio 17 2022:
I already had prepared mpt-7b-instruct.ggmlv3.q4_1.bin which I renamed ggml-mpt-7b-instruct.ggmlv3.q4_1.bin, downloaded from: https://huggingface.co/TheBloke/MPT-7B-Instruct-GGML/tree/main. This did not get recognised correctly:
- gptj_model_load: invalid model file ... (bad vocab size 2003 != 4096) and GPT-J ERROR: failed to load model although of course it's not a GPT-J model.
I then downloaded Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_1.bin renamed to ggml-Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_1.bin (https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML/tree/main). And this works.

However, it did not seem to use my GPU, despite me setting it buildVariant = "cuda";, so that's what I'm looking into at the moment.

Edit: It's clearly doing work in the cuda-enabled library (the cut-off name is ggml_graph_compute):

Edit2: Added simple debugging printf to line 9786 in ggml.c and looks like the check ggml_cuda_can_mul_mat(...) is simply never true in my case. Maybe I need a different model? But that's just a guess. To really understand what's going on I'd need to spend more time to understand llama.cpp.

Edit3: Added set(CMAKE_AUTOMOC OFF) to the beginning of gpt4all-backend/CMakeLists.txt. This makes it easier for me to understand the compilation output and should not mess up anything I think (but I'm no expert here). Aside: It'd probably be better to not globally set it ON in the chat CMakeLists.txt, but only for the targets that actually use Qt. Might improve build speed slightly, too.

Edit4: One thing that feels odd is that the macro definition GGML_USE_CUBLAS is only ever activated in the compiler options of ggml.c, but llama.cpp (the file, not the project) has an #ifdef section depending on it. Talking about mainline here, but I think other targets have that, too.

Jun 01 '23 16:06 cosmic-snow

@cosmic-snow thanks for the testing efforts!! Please note that MPT/GPT-J isn't supported in the new GGML formats yet. I have added the missing compile defines to the CMake file for llama, please try again now. :-)

Jun 02 '23 07:06 niansa

I'm getting the error:

CMake Error at llama.cpp.cmake:280 (target_compile_definitions):
  Cannot specify compile definitions for target "llama-230511-cuda" which is
  not built by this project.
Call Stack (most recent call first):
  CMakeLists.txt:90 (include_ggml)

Note: I just copied your most recent changes over, not going through Git. Not sure if that changed any line numbers, but the error should be clear: CUDA isn't present yet in that version.

I think I've seen some conditionals like that in CMakeLists.txt. Maybe I can fix it myself.

Edit: I was mistaken, a previous build produced a llama-230511-cuda.dll. Sorry, it's probably better to just start from a clean slate again.

Edit2: Trying again with a clean version of the patchset helped already, but now I'm getting the GGML_CUDA_DMMV_X/GGML_CUDA_DMMV_Y error again which I thought was resolved. Although I can see they're supposed to be defined in the cmake files -- in the compiler string for ...\llama.cpp-mainline\ggml-cuda.cu they show up empty: ... -DGGML_CUDA_DMMV_X= -DGGML_CUDA_DMMV_Y= .... I'm starting to think it's something on my end I'm missing here.

Edit3: Maybe it's an ordering problem now in how the CMakeLists.txt get read? Copying the following from ...\llama.cpp-mainline\CMakeLists.txt to right before they're used in llama.cpp.cmake fixed that particular error:

    set(LLAMA_CUDA_DMMV_X "32" CACHE STRING "llama: x stride for dmmv CUDA kernels")
    set(LLAMA_CUDA_DMMV_Y "1" CACHE STRING  "llama: y block size for dmmv CUDA kernels")

    if (GGML_CUBLAS_USE)
        target_compile_definitions(ggml${SUFFIX} PRIVATE
            GGML_USE_CUBLAS
            GGML_CUDA_DMMV_X=${LLAMA_CUDA_DMMV_X}
            GGML_CUDA_DMMV_Y=${LLAMA_CUDA_DMMV_Y})
    ...

Edit4: Something is still decidedly wrong here. I'm now getting a linker error (in short, it doesn't find the LLModel::construct() symbol) when trying to build the chat application and that doesn't look like something that was even touched by your previous commit. I know where its implementation is, but somehow the llmodel.dll just winds up empty now, inspecting it with 'DLL Export Viewer' at least says that. I have successfully built that on the main branch yesterday and can see the symbol in that version's DLL.

I'll keep trying for a bit, but I guess I ultimately need to figure out what's wrong with the build process as a whole here.

Jun 02 '23 08:06 cosmic-snow

I appologize, there was a little mistake in the llama.cpp.cmake :-) That should be solved now. Again, thanks a lot for testing all this!

Jun 02 '23 10:06 niansa

That should be solved now. Again, thanks a lot for testing all this!

You're welcome. And yes, although I'm not going to pull those fixes again right now, that looks like it solves that particular problem.

In the meantime I've managed to get it to work somehow, although I don't understand it yet. And can confirm it was running on CUDA (still v11.6 instead of the latest v12.1), at least until it crashed:

Next, I guess I'll try to figure out:

Build problems, esp. error MSB3073 with code -1073741819 / 0xC000 0005, which seems to be the main culprit
the /arch:AVX2 warning

Edit: I think I've found the problem with the /arch:AVX2. Here: https://github.com/nomic-ai/gpt4all/blob/e85908625f25190ad43f063979e0e95b889bc56b/gpt4all-backend/llama.cpp.cmake#L361-L363 it should be target_compile_options(... instead of definitions(.... I was looking at ...\llama.cpp*\CMakeLists.txt this whole time, so it's no wonder I couldn't figure that one out.

Edit2: Regarding the build problems, I've figured at least something out: If after compiling everything twice the llmodel.dll ends up empty, manually opening its Visual Studio project, disabling /GL (as mentioned above and recommended here) and recompiling it by itself fixes the problem.

Edit3: Maybe also bump the version number? https://github.com/nomic-ai/gpt4all/blob/e85908625f25190ad43f063979e0e95b889bc56b/gpt4all-backend/CMakeLists.txt#L19-L21 The new C API is not compatible with the previous one, otherwise I could've just tested the backend with the Python bindings.

Edit4: So I guess the /GL setting was the problem in all the projects that failed with error MSB3073 ... and had to be built twice. As a workaround, I've added set(IPO_SUPPORTED OFF) right after the following: https://github.com/nomic-ai/gpt4all/blob/e85908625f25190ad43f063979e0e95b889bc56b/gpt4all-backend/CMakeLists.txt#L31-L38 Note: I'm not suggesting it should be turned off permanently for MSVC, maybe myself or someone else is able to figure out why it behaves like that and can come up with a proper fix. I did try with only set(LLAMA_LTO OFF) at first, but that was not enough.

Jun 02 '23 11:06 cosmic-snow

Has conflicts and is outdated. Should it be closed?

Jun 26 '23 17:06 manyoso

No! Back on track :-)

Jun 29 '23 09:06 niansa

@cosmic-snow lots of stuff has happened, specially significantly, cmake fixes. I'd suggest trying again now, if you want :+1:

Jun 29 '23 11:06 niansa

Alright.

I thought I'd do the standard thing I do these days when just updating main, which is making a backend build just by itself at first (from within MSYS2; MinGW64):

cd gpt4all-backend; mkdir build && cd build
cmake ..  # then cmake --build .

This already failed, because the CUDA build doesn't work when I'm only running a MinGW build:

Details

-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6/include (found version "11.6.55")
CMake Error at C:/dev/env/msys64/mingw64/share/cmake/Modules/CMakeDetermineCompilerId.cmake:751 (message):
  Compiling the CUDA compiler identification source file
  "CMakeCUDACompilerId.cu" failed.

  Compiler: C:/Program Files/NVIDIA GPU Computing
  Toolkit/CUDA/v11.6/bin/nvcc.exe

  Build flags:
  Id flags: --keep;--keep-dir;tmp -v

  The output was:
  1
  nvcc fatal : Cannot find compiler 'cl.exe' in PATH

Call Stack (most recent call first):
  C:/dev/env/msys64/mingw64/share/cmake/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_CO
MPILER_ID_BUILD)
  C:/dev/env/msys64/mingw64/share/cmake/Modules/CMakeDetermineCompilerId.cmake:53 (__determine_compi
ler_id_test)
  C:/dev/env/msys64/mingw64/share/cmake/Modules/CMakeDetermineCUDACompiler.cmake:307 (CMAKE_DETERMIN
E_COMPILER_ID)
  CMakeLists.txt:56 (enable_language)

-- Configuring incomplete, errors occurred!

Not a show-stopper, but something to keep in mind.

Then I did a regular build inside Qt Creator. It went without a problem and I could run it (but it only used the CPU).

After that, I edited the llmodel.cpp source to add buildVariant = "cuda"; as required and rebuilt it again. That went fine, as well, but my NVIDIA GPU still showed no load after that. I thought the problem was that I tried with current Hermes. Selecting a different model folder (where I stored the model for the previous test) somehow didn't work, although the one I use normally already isn't standard. Finally, I moved the model over to the default folder -- but still didn't get any load on my GPU.

So that's where I've left it. Can't really say what needs to be done now to switch it to GPU. 🤔

Jun 29 '23 20:06 cosmic-snow

-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6/include (found version "11.6.55")
CMake Error at C:/dev/env/msys64/mingw64/share/cmake/Modules/CMakeDetermineCompilerId.cmake:751 (message):
  Compiling the CUDA compiler identification source file
  "CMakeCUDACompilerId.cu" failed.

set CUDATOOLKITDIR = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1

Adding -D CMAKE_CUDA_COMPILER=$(which nvcc) to cmake fixed this for me: cmake . -D TCNN_CUDA_ARCHITECTURES=86 -D CMAKE_CUDA_COMPILER=$(which nvcc) -B build

Perhaps fixing the error should now fix this. Also it would be helpful if you would upload your file to latest changes, even if they do not work to here, so other people could have a look why it does not work.

Jun 30 '23 22:06 jensdraht1999

I have the alleged fix from here: https://github.com/NVlabs/instant-ngp/issues/923

Jun 30 '23 22:06 jensdraht1999

set CUDATOOLKITDIR = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1

Adding -D CMAKE_CUDA_COMPILER=$(which nvcc) to cmake fixed this for me: cmake . -D TCNN_CUDA_ARCHITECTURES=86 -D CMAKE_CUDA_COMPILER=$(which nvcc) -B build

Perhaps fixing the error should now fix this. Also it would be helpful if you would upload your file to latest changes, even if they do not work to here, so other people could have a look why it does not work.

@jensdraht1999 Thanks, for trying to help.

But I wasn't really looking for a fix for that. It was more meant as a note to what happens in a MinGW backend build. I'd expect that to just keep working as it is now if CUDA is not properly configured for it. (It's what's generally used for the bindings on Windows.)

What really matters at the moment is the Qt Creator build, which I've configured for CUDA and is using MSVC instead. That was the one through which I got CUDA working previously, and it already did with v11.6 instead of v12.1 of the CUDA toolkit, so there shouldn't be a need to upgrade.

I feel like it might even be an advantage to have confirmation on whether it functions with an older version for compatibility reasons.

Btw, what you can do to help is just try to build & run yourself on one or more platforms, then document whether it works or if you ran into trouble of some sort. That's all I'm doing at the moment, too.

Jul 02 '23 14:07 cosmic-snow

gpt4all gpt4all copied to clipboard

Added cuda and opencl support

Testing instructions

gpt4all
gpt4all copied to clipboard