KoboldAI-Client United Branch: Issue With TCC Driver Mode

Excited to see support for the new Fairseq models is in the works and decided to do some testing since I have a beefy workstation. Grabbed the 1.17 United branch and got it and its dependencies fully up to date, along with updating to Transformers 4.17+Tokenizers 0.11.4. After giving GPU+CPU Breakmodel mode a shot on Fairseq 13B and confirming everything would load and generate text that wasn't malformed, I turned my attention to GPU + GPU Breakmodel testing. I currently have an RTX 3090 as primary GPU and dusted off a spare Maxwell Titan X as a secondary so I could rack up enough enough VRAM to avoid dumping anything on the CPU.

Having both GPUs in WDDM driver mode works without error and is surprisingly performant -- I was seeing generation times in the 4-10 second range while distributing 26-28 layers on the 3090 and 12-14 on the Titan X. No errors thrown, although I would definitely say the output is still very... alpha-ey (not unexpected, I know this one's still a WIP, just mentioning for completeness). No garbage characters, but it seems to not quite be getting all the tokens out in the right order (often gives sentences like "to store The man to go decided to" and the like -- you can tell what it's trying to do, but it's out of order), and it occasionally mashes words together. (Shouldn't be an OOM issue here as I had free space on both GPUs, or a Breakmodel problem -- behavior was the same on F6.7B Dense even when loaded only on the 3090 with plenty of room to spare.)

I then switched the Titan X into headless TCC mode to claw back the 20% VRAM Windows 10 reserves, rebooted for it to take effect, and kicked off KoboldAI again. This is where things get odd.

KoboldAI sees it and will happily load layers onto it, but 3-6 seconds after sending any input, instead of generating it throws a handful of errors (not an OOM one), pasted below and also in the included log "TCC Errors Fairseq". Adjusting layer distribution, settings, context window, or amount to generate has no effect. I double-checked this behavior with GPT-J6 Skein to make sure it wasn't an issue unique to Fairseq 13B and it also goes haywire, though it at least attempts some output, though it's utterly scrambled. I've attached logs of both behaviors.

Fairseq 13B Breakmodel + TCC Errors:

Traceback (most recent call last):
  File "aiserver.py", line 3001, in generate
    genout, already_generated = tpool.execute(_generate, txt, minimum, maximum, found_entries)
  File "D:\SkyNET\KoboldAI\Kobold_United\miniconda3\lib\site-packages\eventlet\tpool.py", line 132, in execute
    six.reraise(c, e, tb)
  File "D:\SkyNET\KoboldAI\Kobold_United\miniconda3\lib\site-packages\six.py", line 719, in reraise
    raise value
  File "D:\SkyNET\KoboldAI\Kobold_United\miniconda3\lib\site-packages\eventlet\tpool.py", line 86, in tworker
    rv = meth(*args, **kwargs)
  File "aiserver.py", line 2924, in _generate
    genout = generator(
  File "D:\SkyNET\KoboldAI\Kobold_United\miniconda3\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "D:\SkyNET\KoboldAI\Kobold_United\miniconda3\lib\site-packages\transformers\generation_utils.py", line 1222, in generate
    return self.sample(
  File "aiserver.py", line 868, in new_sample
    return new_sample.old_sample(self, *args, **kwargs)
  File "D:\SkyNET\KoboldAI\Kobold_United\miniconda3\lib\site-packages\transformers\generation_utils.py", line 1813, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Switching back to WDDM mode restores normal behavior for both Fairseq and Skein.

I'm not sure if this is a bug, or just something that's not supported yet. If it's the latter, I'd love to see TCC support added -- being able to reuse otherwise retired GPUs as headless secondaries in Breakmodel without wasting any VRAM would be a great help as model sizes continue to grow. (I do know Linux doesn't take a bite out of VRAM, but it'd be nice to keep my Windows drive for enjoying using these large language models, and the Linux drive for exploring finetuning and module creation for them.)

Let me know if there's any other info you need or anything you want me to test.

Specs

Windows 10 Pro 20H2 10900K on z590 128GB RAM RTX 3090 primary Maxwell Titan X secondary NVidia GameReady Driver 511.79 CUDA 11.6 Python 3.8 KoboldAI United current as of 2/14 (not using the temporary drive, installed and updated via the offline install requirements.bat) Transformers 4.17 via github + Tokenizers 0.11.4 Models pulled down directly from the KoboldAI section on Huggingface with all configs/vocab/etc and copied to their own folders within the KoboldAI models folder, and run via "load a model from its directory"

TCC Errors Fairseq.txt TCC Error J6Skein.txt

Feb 15 '22 13:02 Si13x

I got a single M40 this week and did not run into this issue. Can you try exclusively putting layers on your Titan as a test? Also let us know which cuda version you installed when you installed your driver since in TCC mode that matters. I picked 11.6 and KoboldAI's runtime uses 11.1 or higher which in WDDM mode is loaded from our own runtime.

And yes your fairseq results are as i'd expect for now.

Feb 15 '22 16:02 henk717

Hi henk717, thanks for the confirmation on what to expect from Fairseq at the moment! I'm running the latest NVIDIA CUDA version, 11.6.110. I nuked and reinstalled all CUDA components on my machine this afternoon to be 100% sure there weren't any 11.1 files left kicking around from earlier, and ran some more tests. Each combo below was run 2-3 times to check for consistency.

TCC Maxwell Titan (no Breakmodel) + FairSeq 1.3B == No error and produces funky, but expected output. WDDM Maxwell Titan (no Breakmodel) + FairSeq 1.3B == Also successfully generates as above, quality is identical to TCC mode.

TCC Maxwell Titan + Breakmodel (CPU) + FairSeq 6.7B == Generates, but the output is totally malformed (junk characters). WDDM Maxwell Titan + Breakmodel (CPU) + FairSeq 6.7B == Generates, but the output is badly malformed. WDDM Ampere 3090 + Breakmodel (Maxwell Titan) + FairSeq 6.7B == Halts with an INF error. No output.

TCC Maxwell Titan + Breakmodel (CPU) + Skein == Generates, but the output is badly malformed. WDDM Maxwell Titan + Breakmodel (CPU) + Skein == Generates, but the output is badly malformed. WDDM Ampere 3090 + Breakmodel (CPU) + Skein == Generates clean output!

WDDM Ampere 3090 + Breakmodel (CPU) + FairSeq 13B == Generates as expected.

It really seems to be something about Maxwell with Breakmodel that goes berserk, which gets worse if run in TCC teamed with another GPU (the TCC Maxwell + WDDM Ampere combo is the only one that I can get to throw the INF error).

It might be a shot in the dark, but were both FairSeq and Skein originally BF16, rather than regular Float16? I recall seeing notes in Deepspeed's inference-mode documentation warning about trying to span BF16 models on any multi-GPU configs that weren't pure Ampere, but it sounded like Ampere + CPU offload would work ok -- pretty reminiscent from what I'm seeing with Kobold and Breakmodel.

I unfortunately don't currently have access to a second 3090 to test if Ampere + Ampere works (hate these shortages so much), but if nothing goes wrong, I should be able to get my hands on a Pascal Titan sometime in the next week or so. Once I have that in hand, I can at least check if Pascal bugs out in the same circumstances Maxwell does.

I attached logs with examples of some of the malformed output from the tests above (both FairSeq and Skein, in case seeing exactly how the two different models are failing in the problem configs is useful), as well as the one that threw another INF error. TCC Maxwell plus CPU_Fair6.txt WDDM Maxwell plus CPU_Fair6.txt HALTS_WDDM Ampere plus TCC Maxwell_Fair6.txt TCC Maxwell plus CPU_Skein.txt WDDM Maxwell plus CPU_Skein.txt .

Feb 16 '22 02:02 Si13x

For now i'd like to focus on the GPT-J and Neo models since those we know for certain work correctly (at least until we can reliably get good output from them in any of the other modes on XGLM / fairseq-dense). So my own test i exclusively ran on those models. Since my situation includes one AMD GPU and a single Nvidia GPU the test case with Maxwell + CPU is the most interesting to compare.

My GPU is a Tesla M40 which is listed as Maxwell 2.0, both on the WDDM mode and the TCC mode i get identically good performance on Skein when using breakmodel with a 50/50 split to my CPU.

One notable difference is that with the Tesla cards you download drivers specific to a CUDA version, with the Titan i do not see this. So i do not know which CUDA version it uses internally for TCC.

The driver i used for my test was this one : https://www.nvidia.co.uk/Download/driverResults.aspx/186577/en-uk

Feb 16 '22 15:02 henk717

No problem, focusing on J6 and Neo and the Maxwell+CPU combos works for me.

As for Titans and drivers, there's a similar CUDA toolkit + driver package available for them as well, though it's not the Datacenter driver branch, as I don't think NVidia has ever opened the Datacenter drivers to any Titans except for Volta and perhaps Turing.

I wiped out the 511.79 drivers and CUDA that were installed separately before (DDU in safemode, no networking), and then used only the 11.6 CUDA toolkit installer bundle to install everything. The version it bundles for non-Teslas is 511.23.

I snagged it from NVidia's catalogue here: https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exe_local

Curiously, the CUDA bundle driver installs a slightly newer version of the CUDA runtime than either a standalone RTX driver installer, or what GeForce Experience gives you. The 11.6 CUDA bundle gave me an NVCUDA64.dll 11.6.58 driver, according to the Components info tab in the NVidia Control Panel.

I took another crack at running Maxwell + CPU in both WDDM and TCC modes with this driver.

WDDM or TCC Maxwell + CPU + Skein == Badly malformed output in both modes.

I then tried Neo 2.7 Picard via option 7 in the model selection menu, rather than loading from an offline local folder. I wanted to see if that would make a difference for some reason. Interestingly, it pops a warning during download and install, but proceeds.

D:\SkyNET\KoboldAI\Kobold_United\miniconda3\lib\site-packages\transformers\configuration_utils.py:356: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
  warnings.warn(

WDDM Maxwell + CPU + Picard == Malformed output. TCC Maxwell + CPU + Picard == Halts with an INF error. This is the first test case I've found that triggers an INF error without spanning across non-identical GPUs. TCC Maxwell Only + Picard == Clean output.

I rechecked Ampere + TCC Maxwell on Skein and FairSeq6 out of curiosity to see if the 511.23 driver would change anything. Nope, Skein still gives malformed output, and FairSeq6 still throws an INF error. Oh well.

I then wiped out the 511.23 drivers and installed the GeForce 511.65 driver package (511.65-desktop-win10-win11-64bit-international-dch-whql), which included CUDA driver 11.6.99 and repeated all of the above tests exactly. Results were identical, unfortunately.

One last note, it turns out my friend was able to get that spare Pascal Titan mailed out earlier than expected, so I may be able to test Pascal + CPU as early as this weekend.

WDDM Maxwell + CPU_Picard.txt HALTS_TCC Maxwell + CPU_Picard.txt WDDM Maxwell + CPU_Skein.txt TCC Maxwell + CPU_Skein.txt .

Feb 17 '22 03:02 Si13x

I'm back, and I have a clue for you.

I updated KAI-United with your most recent commits, and then ran one final batch of tests on Maxwell, by rolling all the way back to driver 456.71/CUDA 11.1.96. Results were exactly the same as the tests before on the 511.XX drivers with CUDA 11.6.X. (Gibberish in WDDM mode, INF error in TCC, ok if not offloading to anything via Breakmodel.)

I then swapped out the Maxwell Titan and replaced it with the Pascal Titan. The Ampere 3090 stayed installed.

I first tested the Game Ready 511.65 driver + CUDA 11.6.99.

WDDM Pascal + CPU + Picard == Garbage output. TCC Pascal + CPU + Picard == Halts with INF error. WDDM Ampere + TCC Pascal + Picard == Outputs trash, but does NOT throw an INF error.

Since the results with Game Ready 511.65 were so far identical to what I'd seen with Maxwell, I opted to save some time and skipped retesting with the Game Ready 456.71 drivers on Pascal. Instead, I decided to test the 465.71 Studio drivers -- I noticed that 456.71 Studio did NOT list Maxwell as supported, unlike the 456.71 Game Ready drivers. Once installed, I checked CUDA and noted that it was version 11.1.96, just like the GRD version of the driver.

WDDM Pascal + CPU + Picard == Garbage output, including spurious Chinese or Japanese characters. I'm attaching the log for this one as I saw mention in patch notes before about fixing an output "turning Japanese" problem. TCC Pascal + CPU + Picard == Halts with INF error identical to Maxwell as logged before. TCC Pascal Only + Picard == Good output.

I then decided to retry spanning Ampere + Pascal, without the CPU...

WDDM Ampere + WDDM Pascal + Picard == Good output! WDDM Ampere + TCC Pascal + Picard == Beautiful output! (Same good results whether Ampere or Pascal is designated the lead GPU.) WDDM Ampere + TCC Pascal + Skein == Good output!

I then did Ampere + CPU just for the sake of completeness.

WDDM Ampere + CPU + Picard == Good output.

Testing Summary -- Ampere + CPU doesn't seem to care about Game Ready vs Studio, or CUDA 11.6 vs 11.1. Pascal and Maxwell in any Breakmodel config do care, and something's different about the Studio drivers that lets Pascal Breakmodel properly when paired with Ampere, if not the CPU, even though the CUDA runtime is the same version as in the GRD package. Maxwell is not listed as officially supported on the Studio driver branch, and trying to force installation acted strangely enough that I opted not to pursue that further.

Since your M40 will Breakmodel ok with the Datacenter drivers, I'm guessing whatever the key is, is something shared between the Datacenter and Studio driver branches, but is a feature not included in the Gameready driver branch, rather than an actual hardware limitation blocking Maxwell. Wish I could dissect those drivers and bring you more info, but that's sadly past the limit of my skills.

Hope something in here helps, let me know if you have anything else you'd like me to check. WDDM Pascal CPU Picard CUDA 11p1 Studio.txt .

Feb 23 '22 03:02 Si13x

Note: Might be unrelated, but this behaviour also seem to happen with Kepler architecture (Nvidia Jetson) running Fairseq 13B. For some unknown reason, it spits out -inf...

Mar 24 '22 20:03 mrseeker