popsift icon indicating copy to clipboard operation
popsift copied to clipboard

Gauss kernel initialization: unknown error

Open fabiencastan opened this issue 5 years ago • 11 comments

[17:41:52.100709][info] CUDA-Enabled GPU.
Device information:
	- id:                      0
	- name:                    Quadro K4000
	- compute capability:      3.0
	- total device memory:     3071 MB 
	- device memory available: 1303 MB 
	- per-block shared memory: 49152
	- warp size:               32
	- max threads per block:   1024
	- max threads per SM(X):   2048
	- max block sizes:         {1024,1024,64}
	- max grid sizes:          {2147483647,65535,65535}
	- max 2D array texture:    {65536,65536}
	- max 3D array texture:    {4096,4096,4096}
	- max 2D linear texture:   {65000,65000,1048544}
	- max 2D layered texture:  {16384,16384,2048}
	- number of SM(x)s:        4
	- registers per SM(x):     65536
	- registers per block:     65536
	- concurrent kernels:      yes
	- mapping host memory:     yes
	- unified addressing:      yes
	- texture alignment:       512 byte
	- pitch alignment:         32 byte

[17:41:52.176080][info] Supported CUDA-Enabled GPU detected.
[17:41:53.197578][info] Extracting sift features from view '/xxxx/footage/LionWood/IMG_3678.JPG' [gpu]
Choosing device 0: Quadro K4000
/xxxx/popsift/develop/repo/src/popsift/gauss_filter.cu:245
    cudaMemcpyToSymbol failed for Gauss kernel initialization: unknown error

Does anyone have any idea where this error cames from?

fabiencastan avatar Dec 10 '18 16:12 fabiencastan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 30 '20 10:04 stale[bot]

I sometimes get a similar error on an MBP when there are other programs running using GPU (typically Chrome). Closing the program(s) makes the system to switch to the integrated graphic card and then I am able use popsift. Probably a memory size limit?

simogasp avatar May 11 '20 15:05 simogasp

I get the same error:

$ ./popsift-demo -i image.png --print-dev-info
PopSift version: 1.0.0
image.png
Choosing device 0: GeForce RTX 2070
Device information:
    Name: GeForce RTX 2070
    Compute Capability:    7.5
    Total device mem:      8366915584 B 8170816 kB 7979 MB
    Per-block shared mem:  49152
    Warp size:             32
    Max threads per block: 1024
    Max threads per SM(X): 1024
    Max block sizes:       {1024,1024,64}
    Max grid sizes:        {2147483647,65535,65535}
    Number of SM(x)s:      36
    Concurrent kernels:    yes
    Mapping host memory:   yes
    Unified addressing:    yes

/code/my_projects/popsift/src/popsift/gauss_filter.cu:245
    cudaMemcpyToSymbol failed for Gauss kernel initialization: invalid device symbol

According to nvidia-smi, only 600 MiB out of 8000 MiB on the GPU are occupied by other processes.

taketwo avatar May 12 '20 11:05 taketwo

thanks for the report. I guess we need to call in the big guns :-) @griwodz

simogasp avatar May 12 '20 11:05 simogasp

Just a few more data points. Out of curiosity, I commented out that memory copy. This lead to a similar error in uploading SIFT constants:

/code/my_projects/popsift/src/popsift/sift_constants.cu:52
    Failed to upload h_consts to device: invalid device symbol

Commenting this one out, I hit the next one at:

/code/my_projects/popsift/src/popsift/common/debug_macros.cu:24
    called from /code/my_projects/popsift/src/popsift/s_pyramid_build.cu:125
    cudaGetLastError failed: invalid device function

taketwo avatar May 12 '20 12:05 taketwo

Huh. You are getting different error messages on the K4000 and the RTX 2070. That's weird.

Could you try to move the __device__ __constant__ GaussInfo d_gauss; in gauss_filter.cu out of the namespace popsift? Since the binding is symbolic, it is possible that something has changed and the namespace is now a problem.

Another possibility, but I wouldn't know why that should happen if your system has only 1 CUDA card, is that cudaMemcpyToSymbol cannot figure out which card you are trying to use. The constant memory should exist on all CUDA cards anyway. That could be tested by adding a call cudaSetDevice(0); at the top of the init_filter function (that would be just for testing, not a solution in the long term).

griwodz avatar May 12 '20 17:05 griwodz

The amount on constant memory on a CUDA card is quite limited, but all documentation insists that it is because the constant cache size is limited.

Do you have any hints on how I can get recreate the error (on Linux)?

griwodz avatar May 12 '20 17:05 griwodz

Hi, thanks for your answer. I've tried both moving d_gauss out of the popsift namespace and setting device explicitly, all to no avail.

I did not mention before, I am running PopSift in a Docker container. This morning I tried to build and run it on the host system directly, and there were no issues.

Do you have any hints on how I can get recreate the error (on Linux)?

Unfortunately, the Docker image I use is proprietary, so I can not share it. Instead, I've tried to create a minimal image based on nvidia/cuda with the same Ubuntu/CUDA version. To my surprise, when I compile and run PopSift there, it also has no issues. So, apparently, there is something very special about my proprietary image. I'm investigating further, but if you have any ideas or hints what can be tried, please let me know.

taketwo avatar May 13 '20 09:05 taketwo

Is it possible that your main Docker container uses a different CUDA SDK than the host machine, but your test container uses the same SDK as the host?

Since late CUDA 10, NVidia tries to do something about the compatibility hassle (as they are writing here: https://docs.nvidia.com/deploy/cuda-compatibility/index.html), but I have not looked at those compatibility libraries at all.

griwodz avatar May 13 '20 10:05 griwodz

My "main" container is based on the nvidia/cudagl:10.0-devel-ubuntu18.04 image; I used the same one for my "test" container. So both of them have CUDA 10.0. On my host system I used to have 10.2, however this morning I downgraded it to 10.0 just in case. nvidia-smi still reports:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+

but as far as I understand, this is because the driver is tied to a certain CUDA version regardless of which CUDA SDK is actually installed. Also, according to the info on the page you posted, this driver is compatible with all 10.x versions.

I'm currently trying to "bisect" the layers of the "main" container to find the one that introduces the problem.

taketwo avatar May 13 '20 10:05 taketwo

I found the cause and it (seemingly) has nothing to do with CUDA and/or Docker. In my "main" container lld linker is installed and setup to be used by default. That's all. Switching back to the standard gold linker eliminates the issue.

In case you want to reproduce this and check what's going on, simply install lld package and add the following to the root CMakeLists.txt of PopSift:

set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -fuse-ld=lld")

taketwo avatar May 13 '20 14:05 taketwo