open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Failed to load NVDIA driver within CVM (TDX)

Open herozyg opened this issue 2 years ago • 54 comments

NVIDIA Open GPU Kernel Modules Version

535.54.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [X] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu22.04

Kernel Release

6.2

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [X] I am running on a stable kernel release.

Hardware: GPU

A10

Describe the bug

Installed the latest drvier in a TDVM and failed to run "nvidia-smi", log as below:

image

Could you please give any advices? Thank you!

To Reproduce

GPU: A10 CPU: Intel CPU w/ TDX Install Latest driver 535.54.03 in TDVM. Run cmd"nvidia-smi"

Bug Incidence

Always

nvidia-bug-report.log.gz

no.

More Info

No response

herozyg avatar Jul 16 '23 09:07 herozyg

Is it possible to generate and attach an nvidia-bug-report.log.gz? Maybe you would need to run nvidia-bug-report.sh with --safe-mode. Or, maybe attach your kernel log? It would be nice to be able to copy&paste error messages, rather than transcribe from a screenshot.

The "swiotlb buffer is full" error sounds like the problem.

Can you double check that the NVIDIA proprietary driver at the same version (535.54.03) works fine in this configuration? I'm surprised if interaction with swiotlb would different between the open and closed kernel modules.

aritger avatar Jul 17 '23 06:07 aritger

FYI. Thanks @aritger for your replay. Attach report for your review. nvidia-bug-report.log.gz

herozyg avatar Jul 17 '23 07:07 herozyg

My understanding of this issue is: swiotlb currently can allocate up to 256KB contiguous memory. this is a limitation in swiotlb. but the driver requested memory over that limit (e.g., 1MB). so allocation failed and swiotlb reported "swiotlb buffer is full".

we may need some driver changes to support TDX VM. e.g., is it possible for the driver to switch to use dma_alloc/free_coherent() to allocate DMA buffer instead of dma_map/unmap_* family?

gaochaointel avatar Jul 17 '23 07:07 gaochaointel

Hi @herozyg ,

swiotlb buffer is full while sz / 4096 <= total - used might result from Linux not supporting > 512KB (or 256KB) contiguous memory, as @gaochaointel said.

You can try increasing this limit: change the value from 128 to 1024 at https://elixir.bootlin.com/linux/v6.2/source/include/linux/swiotlb.h#L25. Then recompile the guest kernel and boot.

Furthermore, try adding swiotlb=131072,force to qemu parameter -append to increase the size of swiotlb. (For example, -append "swiotlb=131072,force").

Notice: modifying the kernel or not using the default swiotlb size may hurt the performance.

Tan-YiFan avatar Aug 10 '23 12:08 Tan-YiFan

Hi @herozyg ,

swiotlb buffer is full while sz / 4096 <= total - used might result from Linux not supporting > 512KB (or 256KB) contiguous memory, as @gaochaointel said.

You can try increasing this limit: change the value from 128 to 1024 at https://elixir.bootlin.com/linux/v6.2/source/include/linux/swiotlb.h#L25. Then recompile the guest kernel and boot.

Furthermore, try adding swiotlb=131072,force to qemu parameter -append to increase the size of swiotlb. (For example, -append "swiotlb=131072,force").

Notice: modifying the kernel or not using the default swiotlb size may hurt the performance.

Thanks Yifan. Actually, I tried to update 128 to 1024, but still got the same error.

RodgerZhu avatar Aug 28 '23 14:08 RodgerZhu

Hi @herozyg ,

swiotlb buffer is full while sz / 4096 <= total - used might result from Linux not supporting > 512KB (or 256KB) contiguous memory, as @gaochaointel said.

You can try increasing this limit: change the value from 128 to 1024 at https://elixir.bootlin.com/linux/v6.2/source/include/linux/swiotlb.h#L25. Then recompile the guest kernel and boot.

Furthermore, try adding swiotlb=131072,force to qemu parameter -append to increase the size of swiotlb. (For example, -append "swiotlb=131072,force").

Notice: modifying the kernel or not using the default swiotlb size may hurt the performance.

Thanks Yifan. Actually, I tried to set

RodgerZhu avatar Aug 28 '23 14:08 RodgerZhu

@RodgerZhu Check for TDX VMs is added in 535.98:

Updating to the latest version of Nvidia driver may help.

Tan-YiFan avatar Aug 28 '23 14:08 Tan-YiFan

I think it would be useful to combine CVM with non-CC GPUs. It may not be entirely safe, but it could be considered as an option to GPU more widely used. when I examined the code of Nvidia Open GPU Kernel Modules, I found that Nvidia has implemented checks and processing for SEV, presumably decrypting the relevant memory. Like code in nv-vm.c, when unencrypted set to true(should be true inside sev), all the allocations go to dma_alloc_coherent, which should make memory decrypted. All the maps go to nv_adjust_pgprot, and make memory decrypted. But when I use 3090 with AMD SEV, after GPU processing, the data turns into ciphertext. When I use SNP, I encounter error Unsupported exit-code 0x404 in #VC exception, which seems to occur when memory is set as shared and pvalidate is called, resulting in the memory being invalidated. I think that decrypted memory shouldn’t trigger #VC exception. @Tan-YiFan Any suggestions?

wdsun1008 avatar Sep 07 '23 03:09 wdsun1008

@wdsun1008 I do not have access to CVM+GPU, so I cannot reproduce this problem. I guess:

  1. Whether a page is private or shared is controlled by the C-bit of stage-1 page table. dma_alloc_coherent makes the kernel-mode VA become shared. However, Cuda uses user-level VA and their page table differ. So accessing the page in user-mode might trigger #VC, because the C-bit in user-level page table remains 1 as default.
  2. The user-level instruction might be a DMA request. DMA of private memory might cause #VC.

Tan-YiFan avatar Sep 07 '23 12:09 Tan-YiFan

Thanks for your reply, is there any functions to clear user space C-bit?

在 2023年9月7日星期四,Jimmy Tan @.***> 写道:

@wdsun1008 https://github.com/wdsun1008 I do not have access to CVM+GPU, so I cannot reproduce this problem. I make two guesses:

  1. Whether a page is private or shared is controlled by the C-bit of stage-1 page table. dma_alloc_coherent makes the kernel-mode VA become shared. However, Cuda uses user-level VA and their page table differ. So accessing the page in user-mode might trigger #VC, because the C-bit in user-level page table remains 1 as default.
  2. The user-level instruction might be a DMA request. DMA of private memory might cause #VC.

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/open-gpu-kernel-modules/issues/531#issuecomment-1710055171, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUQQDLNJXDW7GDHRPAAU6CLXZG4ARANCNFSM6AAAAAA2L2V7SM . You are receiving this because you were mentioned.Message ID: @.***>

wdsun1008 avatar Sep 07 '23 14:09 wdsun1008

@wdsun1008

I am trying to implement clearing user-space C-bit. I did not find an existing interface.

You can try executing some simple user-space code to locate the problem (https://github.com/AMDESE/AMDSEV/issues/177#issuecomment-1709645996)

Tan-YiFan avatar Sep 08 '23 01:09 Tan-YiFan

@wdsun1008

I am trying to implement clearing user-space C-bit. I did not find an existing interface.

You can try executing some simple user-space code to locate the problem (AMDESE/AMDSEV#177 (comment))

Here's some of my simple tests on SEV (without SNP, most test cause #VC 404):

  1. malloc cpu mem and cudaMalloc gpu mem, cudaMemcpy to device, cudaMemcpy to host, print value is ciphertext
  2. cudaMallocManaged UVM mem, cuda kernel function to process the mem, cuda returns "an illegal memory access was encountered" with dmesg:
nvidia 0000:05:00.0: swiotlb buffer is full (sz: 2097152 bytes), total 524288 (slots), used 564 (slots)
[65000.453596] NVRM: GPU at PCI:0000:05:00: GPU-54ca9673-89a7-afd6-a37f-cb6c3c3f1f48
[65000.453605] NVRM: Xid (PCI:0000:05:00): 31, pid=19132, name=a.out, Ch 00000006, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7fc0_a2000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

swiotlb was adjusted to 1024MB

wdsun1008 avatar Sep 08 '23 04:09 wdsun1008

@wdsun1008 Test 1 shows that cudaMalloc and cudaMemcpy would cause DMA from private memory. Test 2 shows that cudaMallocManaged make use of swiotlb. I suggest adding some debug information in linux/kernel/dma/swiotlb.c to find out why swiotlb buffer is full. Changing IO_TLB_SEGSIZE from 128 to 1024 works for me but fails for others. I believe fixing the swiotlb issue would pass the UVM test.

Tan-YiFan avatar Sep 08 '23 07:09 Tan-YiFan

@Tan-YiFan dcu-patch Here is a patch of Hygon DCU kernel, which implemented user space decrypt function. They don't have any reference in kernel code, maybe the function can be called by device driver to decrypt memory?

wdsun1008 avatar Sep 11 '23 12:09 wdsun1008

@wdsun1008 In this patch, __set_memory_enc_dec_user is almost the same as __set_memory_enc_dec except that the page table pointer is passed as a function parameter. This function should be called in kernel-space because it modifies the page table.

It could be called by device driver. The user-space can use ioctl to pass the user-space virtual address to this function.

Tan-YiFan avatar Sep 11 '23 13:09 Tan-YiFan

@Tan-YiFan I tried using a simple ko to perform user-space memory decryption, but the GPU computation still returns encrypted text. Here is my test code:

# memko.c
#include <linux/ioctl.h>
#include <linux/fs.h>
#include <linux/uaccess.h>
#include <linux/slab.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/mm.h>
#include <asm/set_memory.h>

#define IOCTL_MEM_DECRYPT _IOW('k', 1, unsigned long)

static long device_ioctl(struct file *file, unsigned int ioctl_num, unsigned long ioctl_param) 
{
    unsigned long user_addr;
    unsigned long user_size;

    switch (ioctl_num) {
        case IOCTL_MEM_DECRYPT:
            // Copy the address and size from user space
            if (copy_from_user(&user_addr, (unsigned long *)ioctl_param, sizeof(unsigned long)) != 0)
                return -EFAULT;
            
            if (copy_from_user(&user_size, (unsigned long *)(ioctl_param + sizeof(unsigned long)), sizeof(unsigned long)) != 0)
                return -EFAULT;

            printk(KERN_INFO "Received address: 0x%lx, size: %lu\n", user_addr, user_size);

            // Convert the size to number of pages
            unsigned long numberOfPages = user_size / PAGE_SIZE;
            if (user_size % PAGE_SIZE != 0) {
                ++numberOfPages;
            }

            // Obtain the current process's mm_struct
            struct mm_struct *mm = current->mm;

            int ret = set_memory_decrypted_userspace(user_addr, numberOfPages, mm);
            printk("decrypt %d\n", ret);
            return 0;
            break;

        default:
            return -ENOTTY;
    }

    return 0;
}

static struct file_operations fops = 
{
    .unlocked_ioctl = device_ioctl,
};

static int __init memko_init(void) 
{
    int major;
    major = register_chrdev(0, "memko", &fops);
    if (major < 0) {
        printk ("Registering the character device failed with %d\n", major);
        return major;
    }

    printk("The major number is %d.\n", major);
    return 0;
}

static void __exit memko_exit(void) 
{
    unregister_chrdev(0, "memko");
}

module_init(memko_init);
module_exit(memko_exit);

MODULE_LICENSE("GPL");
# test.cu
#include <stdio.h> 
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/ioctl.h>
#define IOCTL_MEM_DECRYPT _IOW('k', 1, unsigned long)
__global__ void helloCUDA(int* a) 
{
     a[threadIdx.x] += 2;
}
void HANDLE_ERROR(cudaError_t cuda_error_code){
    if(cuda_error_code != cudaSuccess) 
        printf("[E] CUDA返回错误: %s\n", cudaGetErrorString(cuda_error_code));
}

int decryptm(int fd, unsigned long addr, unsigned long size) {
    unsigned long args[2];
    args[0] = addr;
    args[1] = size;
    int retval=ioctl(fd,IOCTL_MEM_DECRYPT,args);  
    if(retval==-1)  
    {  
        perror("ioctl error\n");  
        exit(-1);  
    }  
}

int main() 
{ 
    int             *a, *dev_a;
    int             deviceId;
    int fd;
    int retval; 

    fd=open("/dev/memko", O_RDWR);  
    if(fd==-1)  
    {  
        perror("error open\n");  
        exit(-1);  
    }  
    printf("open /dev/memko successfully\n"); 

    HANDLE_ERROR(cudaGetDevice(&deviceId)); 

    a = (int*)malloc(10 * sizeof(*a));    
    retval = decryptm(fd, (unsigned long)a, 10 * sizeof(*a));
    if (retval != 0) {
        perror("error decrypt\n");  
        exit(-1); 
    }
    for(int i = 0; i < 10; i++) {
        a[i] = i;
        printf("%d,", a[i]);
    }
    
    HANDLE_ERROR(cudaMalloc((void**)&dev_a,
        10 * sizeof(*dev_a)));
    
    HANDLE_ERROR(cudaMemcpy(dev_a, a,
            10 * sizeof(*dev_a),
            cudaMemcpyHostToDevice));
    helloCUDA<<<1, 10>>>(dev_a);
    
    HANDLE_ERROR(cudaDeviceSynchronize());
    
    HANDLE_ERROR(cudaMemcpy(a, dev_a,
            10 * sizeof(*dev_a),
            cudaMemcpyDeviceToHost));
    
    for(int i = 0; i < 10; i++) {
        printf("%d,", a[i]);
    }
    return 0;
}

wdsun1008 avatar Sep 13 '23 06:09 wdsun1008

@wdsun1008 I am trying to check whether the host hypervisor (kvm) could get the plain text of user-space data in CVM. Thanks for your code.

Tan-YiFan avatar Sep 13 '23 06:09 Tan-YiFan

@wdsun1008 I am sorry for not testing it successfully. You can refer to https://github.com/AMDESE/AMDSEV/issues/185, which is a similar issue and has been handled.

Tan-YiFan avatar Oct 13 '23 01:10 Tan-YiFan

@wdsun1008 I am sorry for not testing it successfully. You can refer to AMDESE/AMDSEV#185, which is a similar issue and has been handled.

No worries, I haven't been successful either, it seems like we might need to rely on the future Trusted Device/TEE IO solution. I have taken note of that issue and will continue to monitor any progress related to it. If there are any updates, I will keep you informed through the relevant issue.

wdsun1008 avatar Oct 13 '23 05:10 wdsun1008

We upgrade to use latest NV driver:

GPU: A10 CPU: Intel CPU w/ TDX Install Latest driver 535.129.03 in TDVM.

lspci
02:00.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)

lsmod
Module                  Size  Used by
nvidia_modeset       1282048  0
nvidia_drm             16384  0
nvidia_uvm           1396736  0
nvidia              56565760  2 nvidia_uvm,nvidia_modeset

dmesg
[   60.988565] nvidia: loading out-of-tree module taints kernel.
[   60.988575] nvidia: module license 'NVIDIA' taints kernel.
[   60.988576] Disabling lock debugging due to kernel taint
[   61.195354] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   61.218186] nvidia-nvlink: Nvlink Core is being initialized, major device number 245

[   61.219818] ACPI: \_SB_.GSIG: Enabled at IRQ 22
[   61.219984] nvidia 0000:02:00.0: enabling device (0140 -> 0142)
[   62.083636] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.129.03  Thu Oct 19 18:56:32 UTC 2023
[   62.135673] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[   62.139783] nvidia-uvm: Loaded the UVM driver, major device number 243.
[   62.176665] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.129.03  Thu Oct 19 18:42:12 UTC 2023

Run cmd "nvidia-smi"

No devices were found
dmesg
[   62.176665] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.129.03  Thu Oct 19 18:42:12 UTC 2023
[  162.297760] nvidia 0000:02:00.0: Direct firmware load for nvidia/535.129.03/gsp_ga10x.bin failed with error -2
[  162.310423] ACPI Warning: \_SB.PCI0.S30.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20221020/nsarguments-61)
[  167.286673] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x25:0x65:1470)
[  167.287591] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
[  167.303284] nvidia 0000:02:00.0: Direct firmware load for nvidia/535.129.03/gsp_ga10x.bin failed with error -2
[  172.022455] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x25:0x65:1470)
[  172.023363] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0

@Tan-YiFan any suggestions?

arronwy avatar Nov 17 '23 05:11 arronwy

@arronwy Here is some of the information acquired from your log:

[ 162.297760] nvidia 0000:02:00.0: Direct firmware load for nvidia/535.129.03/gsp_ga10x.bin failed with error -2

This firmware should be stored at /usr/lib/firmware/nvidia.

[ 167.286673] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x25:0x65:1470)

0x25 => RM_INIT_GPU_LOAD_FAILED, 0x65 => NV_ERR_TIMEOUT

Is the driver installed by NVIDIA-Linux-x86_64-535.129.03.run (or cuda installer) without adding parameter -m=kernel-open? If so, I suggest installing the driver by either sh NVIDIA-Linux-x86_64-535.129.03.run -m=kernel-open, or git clone this repo and checkout the version and make modules -j $(nproc) and make modules_install

Tan-YiFan avatar Nov 17 '23 07:11 Tan-YiFan

Thanks @Tan-YiFan , I rebuild the kernel module with -m=kernel-open parameter as you mentioned and ensure firmware gsp_ga10x.bin exists in the Guest OS, but seems still can not find the firmware and have new error message:

ls -alh /usr/lib/firmware/nvidia/535.129.03/gsp_ga10x.bin
-r--r--r-- 1 root root 37M Nov 17 07:40 /usr/lib/firmware/nvidia/535.129.03/gsp_ga10x.bin

md5sum /usr/lib/firmware/nvidia/535.129.03/gsp_ga10x.bin
baca3ef5eba805553186c9322c172fa1  /usr/lib/firmware/nvidia/535.129.03/gsp_ga10x.bin

[   42.842148] nvidia-uvm: Loaded the UVM driver, major device number 243.
[   43.722516] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  535.129.03  Release Build  (dvs-builder@U16-I3-B15-1-1)  Thu Oct 19 18:46:10 UTC 2023
[   59.137950] nvidia 0000:02:00.0: Direct firmware load for nvidia/535.129.03/gsp_ga10x.bin failed with error -2
[   59.137957] NVRM RmFetchGspRmImages: No firmware image found
[   59.137961] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x61:0x56:1594)
[   59.138755] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0

My driver build command:

./NVIDIA-Linux-x86_64-535.129.03.run -x && cd NVIDIA-Linux-x86_64-535.129.03
./nvidia-installer -a -q --ui=none \
 --no-cc-version-check \
 --no-opengl-files --no-install-libglvnd \
 -m=kernel-open \
 --kernel-source-path=

arronwy avatar Nov 17 '23 07:11 arronwy

@arronwy According to this line of log:

[ 59.137950] nvidia 0000:02:00.0: Direct firmware load for nvidia/535.129.03/gsp_ga10x.bin failed with error -2

It is at https://elixir.bootlin.com/linux/v6.6/source/drivers/base/firmware_loader/main.c#L905. The return value -2 is from #define ENOENT 2 /* No such file or directory */

I suggest the following steps:

  • Checking: /lib should exist and symbolic link to /usr/lib.
  • Adding debug prints in the guest kernel near the function _request_firmware.

Tan-YiFan avatar Nov 17 '23 08:11 Tan-YiFan

Thanks @Tan-YiFan , I change the firmware path to "/lib/firmware", nvidia-smi works:, I do deviceQuery also passed, but run other sample cuda apps will have error:

nvidia-smi
Fri Nov 17 08:12:55 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                     Off | 00000000:02:00.0 Off |                    0 |
|  0%   52C    P0              59W / 150W |      4MiB / 23028MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA A10"
  CUDA Driver Version / Runtime Version          12.2 / 12.2
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 22516 MBytes (23609475072 bytes)
  (072) Multiprocessors, (128) CUDA Cores/MP:    9216 CUDA Cores
  GPU Max Clock rate:                            1695 MHz (1.70 GHz)
  Memory Clock rate:                             6251 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.2, NumDevs = 1
Result = PASS

./bf16TensorCoreGemm
CUDA error at ../../../Common/helper_cuda.h:888 code=801(cudaErrorNotSupported) "cudaSetDevice(devID)"
Initializing...

[  160.280094] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1326
[  160.280099] NVRM sysmemConstruct_IMPL: *** Cannot allocate sysmem through fb heap
[  174.601818] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1326

arronwy avatar Nov 17 '23 08:11 arronwy

@arronwy The error flag is NV_ERR_NOT_SUPPORTED but I could not find which line of code set this flag.

Below is my guess:

The Cannot allocate sysmem through fb heap is at https://github.com/NVIDIA/open-gpu-kernel-modules/blob/535.129.03/src/nvidia/src/kernel/mem_mgr/system_mem.c#L226, around which (at line 212) is code related to CVM:

    if ((sysGetStaticConfig(SYS_GET_INSTANCE()))->bOsCCEnabled &&
        gpuIsCCorApmFeatureEnabled(pGpu) &&
        FLD_TEST_DRF(OS32, _ATTR2, _MEMORY_PROTECTION, _UNPROTECTED,
                     pAllocData->attr2))
        {
            memdescSetFlag(pMemDesc, MEMDESC_FLAGS_ALLOC_IN_UNPROTECTED_MEMORY,
                           NV_TRUE);
        }
  • bOsCCEnabled is equal to os_cc_enabled, os_cc_enabled is set at nv_detect_conf_compute_platform and should be 1.
  • gpuIsCCorApmFeatureEnabled might be 0 because the GPU is not H100.

To solve this issue, I would try hacking into the Nvidia kernel module:

  1. Which line of code set the error flag NV_ERR_NOT_SUPPORTED?
  2. Is it related to the flag MEMDESC_FLAGS_ALLOC_IN_UNPROTECTED_MEMORY?

What's more, using the version 535.129.03 is not suggested. See https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html#known-issues (search "confidential"). Nvidia suggests 535.104.05.

Tan-YiFan avatar Nov 17 '23 09:11 Tan-YiFan

Thanks @Tan-YiFan , I tried with 535.104.05 seems have the same error:

nvidia-smi
Fri Nov 17 09:36:51 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                     Off | 00000000:02:00.0 Off |                    0 |
|  0%   51C    P0              56W / 150W |      4MiB / 23028MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

./bf16TensorCoreGemm
CUDA error at ../../../Common/helper_cuda.h:888 code=801(cudaErrorNotSupported) "cudaSetDevice(devID)"
Initializing...

dmesg
[   36.591847] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   36.595762] nvidia-nvlink: Nvlink Core is being initialized, major device number 245

[   36.596639] ACPI: \_SB_.GSIG: Enabled at IRQ 22
[   36.596801] nvidia 0000:02:00.0: enabling device (0140 -> 0142)
[   37.449941] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.104.05  Release Build  (dvs-builder@U16-I2-C04-35-2)  Sat Aug 19 01:13:27 UTC 2023
[   37.499394] nvidia-uvm: Loaded the UVM driver, major device number 243.
[   37.562121] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  535.104.05  Release Build  (dvs-builder@U16-I2-C04-35-2)  Sat Aug 19 01:03:29 UTC 2023
[   49.331252] ACPI Warning: \_SB.PCI0.S30.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20221020/nsarguments-61)
[  117.335047] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1326
[  117.335051] NVRM sysmemConstruct_IMPL: *** Cannot allocate sysmem through fb heap
[  143.503077] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1326
[  143.503081] NVRM sysmemConstruct_IMPL: *** Cannot allocate sysmem through fb heap

arronwy avatar Nov 17 '23 09:11 arronwy

Hi @Tan-YiFan I rebuilt the driver version 535.104.05 with below patch:

git diff
diff --git a/src/nvidia/src/kernel/mem_mgr/system_mem.c b/src/nvidia/src/kernel/mem_mgr/system_mem.c
index 250dc400c8a0..6e67422bdf7e 100644
--- a/src/nvidia/src/kernel/mem_mgr/system_mem.c
+++ b/src/nvidia/src/kernel/mem_mgr/system_mem.c
@@ -209,14 +209,8 @@ sysmemConstruct_IMPL

     memdescSetFlag(pMemDesc, MEMDESC_FLAGS_SYSMEM_OWNED_BY_CLIENT, NV_TRUE);

-    if ((sysGetStaticConfig(SYS_GET_INSTANCE()))->bOsCCEnabled &&
-        gpuIsCCorApmFeatureEnabled(pGpu) &&
-        FLD_TEST_DRF(OS32, _ATTR2, _MEMORY_PROTECTION, _UNPROTECTED,
-                     pAllocData->attr2))
-        {
-            memdescSetFlag(pMemDesc, MEMDESC_FLAGS_ALLOC_IN_UNPROTECTED_MEMORY,
+    memdescSetFlag(pMemDesc, MEMDESC_FLAGS_ALLOC_IN_UNPROTECTED_MEMORY,
                            NV_TRUE);
-        }

     memdescSetGpuCacheAttrib(pMemDesc, gpuCacheAttrib);

@@ -224,7 +218,7 @@ sysmemConstruct_IMPL
     if (rmStatus != NV_OK)
     {
         NV_PRINTF(LEVEL_ERROR,
-                  "*** Cannot allocate sysmem through fb heap\n");
+                  "*** Cannot allocate sysmem through fb heap3\n");
         memdescFree(pMemDesc);
         memdescDestroy(pMemDesc);
         goto failed;

still have this error:

[ 1117.665051] nvidia-modeset: Unloading
[ 1117.668337] nvidia-uvm: Unloaded the UVM driver.
[ 1117.670403] nvidia-nvlink: Unregistered Nvlink Core, major device number 245
[ 1637.798500] nvidia-nvlink: Nvlink Core is being initialized, major device number 245
[ 1637.798507] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.104.05  Release Build  (root@localhost)  Fri Nov 17 11:12:31 UTC 2023
[ 1637.901119] nvidia-uvm: Loaded the UVM driver, major device number 243.
[ 1638.591124] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  535.104.05  Release Build  (root@localhost)  Fri Nov 17 11:08:44 UTC 2023
[ 1682.309982] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1326
[ 1682.309985] NVRM sysmemConstruct_IMPL: *** Cannot allocate sysmem through fb heap3

arronwy avatar Nov 17 '23 11:11 arronwy

@arronwy I'm sorry but I could not solve this problem. I do not have access to TDX machines so I could not reproduce this problem. Checking the source of NV_ERR_NOT_SUPPORTED might help.

Tan-YiFan avatar Nov 18 '23 01:11 Tan-YiFan

@arronwy I'm sorry but I could not solve this problem. I do not have access to TDX machines so I could not reproduce this problem. Checking the source of NV_ERR_NOT_SUPPORTED might help.

Thanks @Tan-YiFan ,

I added below debug info:

diff --git a/src/nvidia/arch/nvalloc/unix/src/os.c b/src/nvidia/arch/nvalloc/unix/src/os.c
index bb03eac64e06..94ad9e4f3e08 100644
--- a/src/nvidia/arch/nvalloc/unix/src/os.c
+++ b/src/nvidia/arch/nvalloc/unix/src/os.c
@@ -923,6 +923,7 @@ NV_STATUS osAllocPagesInternal(
             memdescGetGuestId(pMemDesc),
             memdescGetPteArray(pMemDesc, AT_CPU),
             &pMemData);
+            NV_PRINTF(LEVEL_ERROR, "%s: osAllocPagesInternal MEMDESC_FLAGS_GUEST_ALLOCATED %d\n", __FUNCTION__, status);
     }
     else
     {
@@ -962,6 +963,8 @@ NV_STATUS osAllocPagesInternal(
                 nodeId,
                 memdescGetPteArray(pMemDesc, AT_CPU),
                 &pMemData);
+
+            NV_PRINTF(LEVEL_ERROR, "%s: osAllocPagesInternal unencrypted %d\n", __FUNCTION__, status);
         }

         if (nv && nv->force_dma32_alloc)

And dmesg shows:

dmesg|grep osAllocPagesInternal
[ 4368.348385] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.348583] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.361951] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.362486] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.362968] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.363449] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.363608] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.363779] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.363884] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.462940] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.463516] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.464192] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.464584] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.464840] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4369.394464] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.180462] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.180967] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.199681] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.213703] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.217923] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.218577] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.218818] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.226344] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.228031] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.228280] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.342967] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.347899] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.348141] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.385904] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.388129] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.417296] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.430932] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.845001] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.859165] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.859421] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.860246] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.861170] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.864404] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.867470] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.867713] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.868613] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.871809] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.895995] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.896265] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.897869] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 86

any suggestions?

arronwy avatar Nov 20 '23 09:11 arronwy

@arronwy osAllocPagesInternal would call nv_alloc_pages (in file kernel-open/nvidia/nv.c). Debugging into it further might help.

Tan-YiFan avatar Nov 20 '23 10:11 Tan-YiFan