libSGM icon indicating copy to clipboard operation
libSGM copied to clipboard

CUDA ERROR #4 (using FilterMedian kernel #2): unspecified launch failure

Open KeoChi opened this issue 7 years ago • 4 comments

Hi, it is normal when I execute this program ./stereo_test on PC. But there is an error on Nvidia TX2. CUDA ERROR #4 (using FilterMedian kernel #2): unspecified launch failure stereo_test: /home/nvidia/newlibSGM/libSGM/src/median_filter.cu:31: void sgm::details::median_filter(const uint16_t*, uint16_t*, void*, int, int): Assertion status == 0' failed. Aborted (core dumped)`
This mistake confused me for a long time. Looking forward to your reply.

KeoChi avatar Jan 29 '18 03:01 KeoChi

We may possibly mistake how to use NPP functions. In haste, try using following patch that stop using NPP. (I haven't check well.)

diff --git a/sample/image/CMakeLists.txt b/sample/image/CMakeLists.txt
index 56c03d2..a61c287 100644
--- a/sample/image/CMakeLists.txt
+++ b/sample/image/CMakeLists.txt
@@ -15,10 +15,5 @@ include_directories(../../include)
 
 CUDA_ADD_EXECUTABLE(stereo_test stereosgm_image.cpp ${CUDA_SRC})
 
-if(CUDA_VERSION VERSION_LESS "9.0")
-	set(NPP_LIBRARIES "${CUDA_nppi_LIBRARY}")
-else()
-	set(NPP_LIBRARIES "${CUDA_nppif_LIBRARY}")
-endif()
 
-TARGET_LINK_LIBRARIES(stereo_test sgm ${CUDA_LIBRARIES} ${NPP_LIBRARIES} ${OpenCV_LIBS})
+TARGET_LINK_LIBRARIES(stereo_test sgm ${CUDA_LIBRARIES} ${OpenCV_LIBS})
diff --git a/sample/image/stereosgm_image.cpp b/sample/image/stereosgm_image.cpp
index 4416194..87defff 100644
--- a/sample/image/stereosgm_image.cpp
+++ b/sample/image/stereosgm_image.cpp
@@ -61,7 +61,6 @@ int main(int argc, char* argv[]) {
 	cv::Mat output(cv::Size(left.cols, left.rows), CV_8UC1);
 
 	ssgm.execute(left.data, right.data, (void**)&output.data);
-
 	// show image
 	cv::imshow("image", output * 256 / disp_size);
 	
diff --git a/sample/movie/CMakeLists.txt b/sample/movie/CMakeLists.txt
index f15f528..ac67d4d 100644
--- a/sample/movie/CMakeLists.txt
+++ b/sample/movie/CMakeLists.txt
@@ -34,10 +34,5 @@ renderer.cpp
 cuda_gl.cu
 )
 
-if(CUDA_VERSION VERSION_LESS "9.0")
-	set(NPP_LIBRARIES "${CUDA_nppi_LIBRARY}")
-else()
-	set(NPP_LIBRARIES "${CUDA_nppif_LIBRARY}")
-endif()
 
-TARGET_LINK_LIBRARIES(stereo_movie sgm  ${CUDA_LIBRARIES} ${NPP_LIBRARIES} ${OpenCV_LIBS} ${GLEW_LIBRARIES} ${GLFW_LIBRARY} ${OPENGL_LIBRARY})
+TARGET_LINK_LIBRARIES(stereo_movie sgm  ${CUDA_LIBRARIES} ${OpenCV_LIBS} ${GLEW_LIBRARIES} ${GLFW_LIBRARY} ${OPENGL_LIBRARY})
diff --git a/sample/zed/CMakeLists.txt b/sample/zed/CMakeLists.txt
index 30330cd..141ec90 100644
--- a/sample/zed/CMakeLists.txt
+++ b/sample/zed/CMakeLists.txt
@@ -32,11 +32,6 @@ if(NOT WIN32)
 	set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -std=c++11")
 endif()
 
-if(CUDA_VERSION VERSION_LESS "9.0")
-	set(NPP_LIBRARIES "${CUDA_nppi_LIBRARY}")
-else()
-	set(NPP_LIBRARIES "${CUDA_nppif_LIBRARY}")
-endif()
 
 CUDA_ADD_EXECUTABLE(zed_demo
 	zed_demo.cpp
@@ -52,5 +47,4 @@ TARGET_LINK_LIBRARIES(zed_demo
 	${OpenCV_LIBS} 
 	${GLEW_LIBRARIES} ${GLFW_LIBRARY} ${OPENGL_LIBRARY}
 	${ZED_SDK_LIB}
-	${NPP_LIBRARIES}
 )
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 59647cf..d34a81e 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -4,11 +4,6 @@ include_directories(../include)
 
 find_package(CUDA REQUIRED)
 
-if(CUDA_VERSION VERSION_LESS "9.0")
-	set(NPP_LIBRARIES "${CUDA_nppi_LIBRARY}")
-else()
-	set(NPP_LIBRARIES "${CUDA_nppif_LIBRARY}")
-endif()
 
 SET(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} ${CUDA_ARCH}")
 
@@ -24,7 +19,7 @@ SET(STEREOSRCS
 
 if(LIBSGM_SHARED)
 	CUDA_ADD_LIBRARY(sgm stereo_sgm.cpp ${STEREOSRCS} SHARED)
-	target_link_libraries(sgm ${CUDA_LIBRARIES} ${NPP_LIBRARIES})
+	target_link_libraries(sgm ${CUDA_LIBRARIES})
 else()
 	CUDA_ADD_LIBRARY(sgm stereo_sgm.cpp ${STEREOSRCS} STATIC)
 endif()
diff --git a/src/internal.h b/src/internal.h
index 33cd91d..8ba03f4 100644
--- a/src/internal.h
+++ b/src/internal.h
@@ -41,7 +41,7 @@ namespace sgm {
 
 		void winner_takes_all(const uint16_t* d_scost, uint16_t* d_left_disp, uint16_t* d_right_disp, int width, int height, int disp_size);
 		
-		void median_filter(const uint16_t* d_src, uint16_t* d_dst, void* median_filter_buffer, int width, int height);
+		void median_filter(const uint16_t* d_src, uint16_t* d_dst, int width, int height);
 
 		void check_consistency(uint16_t* d_left_disp, const uint16_t* d_right_disp, const void* d_src_left, int width, int height, int depth_bits);
 
diff --git a/src/median_filter.cu b/src/median_filter.cu
index a4824bd..017e124 100644
--- a/src/median_filter.cu
+++ b/src/median_filter.cu
@@ -14,21 +14,152 @@ See the License for the specific language governing permissions and
 limitations under the License.
 */
 
-#include <nppi.h>
-
 #include "internal.h"
 
+namespace {
+	__global__
+		void omwc_gpu(const ushort2* src, uint16_t* dst, int width, int height) {
+		int index, xinG, yinG, i;
+		xinG = (blockDim.x * blockIdx.x + threadIdx.x) * 2 + 1;
+		yinG = blockDim.y * blockIdx.y + threadIdx.y;
+
+		if (xinG < width && yinG < height) {
+			index = width * yinG + xinG;
+			if (xinG == width - 1 || yinG == 0 || yinG == width - 1){
+				dst[index]     = src[index / 2].y;
+				dst[index + 1] = src[(index + 1) / 2].x;
+			}
+			else if (yinG == width - 1) {
+				dst[index + 1] = src[index / 2].y;
+			}
+			else {
+				uint32_t arrL[9];
+				uint32_t arrR[9];
+				ushort2 arrTemp[6];
+
+				arrTemp[0] = src[(index - width - 1) / 2];
+				arrTemp[1] = src[(index - width + 1) / 2];
+				arrTemp[2] = src[(index - 1) / 2];
+				arrTemp[3] = src[(index + 1) / 2];
+				arrTemp[4] = src[(index + width - 1) / 2];
+				arrTemp[5] = src[(index + width + 1) / 2];
+
+				arrL[0] = arrTemp[0].x;
+				arrL[1] = arrTemp[0].y;
+				arrR[0] = arrTemp[0].y;
+				arrL[2] = arrTemp[1].x;
+				arrR[1] = arrTemp[1].x;
+				arrR[2] = arrTemp[1].y;
+				arrL[3] = arrTemp[2].x;
+				arrL[4] = arrTemp[2].y;
+				arrR[3] = arrTemp[2].y;
+				arrL[5] = arrTemp[3].x;
+				arrR[4] = arrTemp[3].x;
+				arrR[5] = arrTemp[3].y;
+				arrL[6] = arrTemp[4].x;
+				arrL[7] = arrTemp[4].y;
+				arrR[6] = arrTemp[4].y;
+				arrL[8] = arrTemp[5].x;
+				arrR[7] = arrTemp[5].x;
+				arrR[8] = arrTemp[5].y;
+
+				uint32_t tempR, tempL;
+#pragma unroll
+				for (i = 1; i < 6; i++){
+					if (arrL[0] > arrL[i]){
+						tempL = arrL[0];
+						arrL[0] = arrL[i];
+						arrL[i] = tempL;
+					}
+					if (arrR[0] > arrR[i]){
+						tempR = arrR[0];
+						arrR[0] = arrR[i];
+						arrR[i] = tempR;
+					}
+				}
+#pragma unroll
+				for (i = 2; i < 6; i++){
+					if (arrL[1] < arrL[i]){
+						tempL = arrL[1];
+						arrL[1] = arrL[i];
+						arrL[i] = tempL;
+					}
+					if (arrR[1] < arrR[i]){
+						tempR = arrR[1];
+						arrR[1] = arrR[i];
+						arrR[i] = tempR;
+					}
+				}
+#pragma unroll
+				for (int i = 3; i < 7; i++){
+					if (arrL[2]>arrL[i]){
+						tempL = arrL[2];
+						arrL[2] = arrL[i];
+						arrL[i] = tempL;
+					}
+					if (arrR[2]>arrR[i]){
+						tempR = arrR[2];
+						arrR[2] = arrR[i];
+						arrR[i] = tempR;
+					}
+				}
+#pragma unroll
+				for (int i = 4; i < 7; i++){
+					if (arrL[3]<arrL[i]){
+						tempL = arrL[3];
+						arrL[3] = arrL[i];
+						arrL[i] = tempL;
+					}
+					if (arrR[3]<arrR[i]){
+						tempR = arrR[3];
+						arrR[3] = arrR[i];
+						arrR[i] = tempR;
+					}
+				}
+#pragma unroll
+				for (int i = 5; i < 8; i++){
+					if (arrL[4]>arrL[i]){
+						tempL = arrL[4];
+						arrL[4] = arrL[i];
+						arrL[i] = tempL;
+					}
+					if (arrR[4]>arrR[i]){
+						tempR = arrR[4];
+						arrR[4] = arrR[i];
+						arrR[i] = tempR;
+					}
+				}
+#pragma unroll
+				for (int i = 6; i < 8; i++){
+					if (arrL[5]<arrL[i]){
+						tempL = arrL[5];
+						arrL[5] = arrL[i];
+						arrL[i] = tempL;
+					}
+					if (arrR[5]<arrR[i]){
+						tempR = arrR[5];
+						arrR[5] = arrR[i];
+						arrR[i] = tempR;
+					}
+				}
+				dst[index]     = max(min(arrL[6], arrL[7]), min(max(arrL[6], arrL[7]), arrL[8]));
+				dst[index + 1] = max(min(arrR[6], arrR[7]), min(max(arrR[6], arrR[7]), arrR[8]));
+			}
+		}
+	}
+}
+
+
 namespace sgm {
 	namespace details {
 
-		void median_filter(const uint16_t* d_src, uint16_t* d_dst, void* median_filter_buffer, int width, int height) {
-			NppiSize roi = { width, height };
-			NppiSize mask = { 3, 3 };
-			NppiPoint anchor = { 0, 0 };
+		void median_filter(const uint16_t* d_src, uint16_t* d_dst, int width, int height) {
+			const int numthread_side_x = 64;
+			const int numthread_side_y = 16;
 
-			NppStatus status = nppiFilterMedian_16u_C1R(d_src, sizeof(Npp16u) * width, d_dst, sizeof(Npp16u) * width, roi, mask, anchor, (Npp8u*)median_filter_buffer);
-			
-			assert(status == 0);
+			dim3 numBlocks((width + numthread_side_x - 1) / numthread_side_x, (height + numthread_side_y - 1) / numthread_side_y, 1);
+			dim3 numThread(numthread_side_x / 2, numthread_side_y, 1);
+			omwc_gpu << < numBlocks, numThread >> >(reinterpret_cast<const ushort2*>(d_src), d_dst, width, height);
 		}
 
 	}
diff --git a/src/stereo_sgm.cpp b/src/stereo_sgm.cpp
index c6e6b75..1e41382 100644
--- a/src/stereo_sgm.cpp
+++ b/src/stereo_sgm.cpp
@@ -16,8 +16,6 @@ limitations under the License.
 
 #include <iostream>
 
-#include <nppi.h>
-
 #include <libsgm.h>
 
 #include "internal.h"
@@ -41,7 +39,6 @@ namespace sgm {
 
 		cudaStream_t cuda_streams[8];
 
-		Npp32u median_buffer_size;
 		void* d_median_filter_buffer;
 
 		void* d_output_16bit_buffer;
@@ -76,15 +73,6 @@ namespace sgm {
 				CudaSafeCall(cudaStreamCreate(&this->cuda_streams[i]));
 			}
 
-			NppiSize roi = { width_, height_ };
-			NppiSize mask = { 3, 3 }; // width, height
-			NppStatus status;
-			status = nppiFilterMedianGetBufferSize_16u_C1R(roi, mask, &this->median_buffer_size);
-			if (status != 0) {
-				throw std::runtime_error("nppi error");
-			}
-			CudaSafeCall(cudaMalloc(&this->d_median_filter_buffer, this->median_buffer_size));
-
 			// create temporary buffer when dst type is 8bit host pointer
 			if (!is_cuda_output(inout_type_) && output_depth_bits_ == 8) {
 				this->h_output_16bit_buffer = (uint16_t*)malloc(sizeof(uint16_t) * width_ * height_);
@@ -184,8 +172,8 @@ namespace sgm {
 
 		sgm::details::winner_takes_all((const uint16_t*)cu_res_->d_scost, (uint16_t*)cu_res_->d_left_disp, (uint16_t*)cu_res_->d_right_disp, width_, height_, disparity_size_);
 
-		sgm::details::median_filter((uint16_t*)cu_res_->d_left_disp, (uint16_t*)cu_res_->d_tmp_left_disp, cu_res_->d_median_filter_buffer, width_, height_);
-		sgm::details::median_filter((uint16_t*)cu_res_->d_right_disp, (uint16_t*)cu_res_->d_tmp_right_disp, cu_res_->d_median_filter_buffer, width_, height_);
+		sgm::details::median_filter((uint16_t*)cu_res_->d_left_disp, (uint16_t*)cu_res_->d_tmp_left_disp, width_, height_);
+		sgm::details::median_filter((uint16_t*)cu_res_->d_right_disp, (uint16_t*)cu_res_->d_tmp_right_disp, width_, height_);
 
 		sgm::details::check_consistency((uint16_t*)cu_res_->d_tmp_left_disp, (uint16_t*)cu_res_->d_tmp_right_disp, d_input_left, width_, height_, input_depth_bits_);

ykitta-fixstars avatar Jan 30 '18 05:01 ykitta-fixstars

Thank you for your reply. I changed another development board TX2 to test, and the code has run successfully

KeoChi avatar Jan 30 '18 11:01 KeoChi

Hello, I'd like to elaborate regarding:

We may possibly mistake how to use NPP functions.

I think this may very well be the case indeed. To be fair, the NPP docs could be more helpful. I wrote a little patch to demonstrate a possibly more correct usage, see this commit. The things to look for are:

  • Image ROI: with the current settings, the algorithm is going to read pixels two columns and two rows too far.
  • The kernel origin is not at the center. This may or may not be what you want, I'm thinking not.
  • I shifted the image pointers one row down and one column right because of the change in kernel origin.

This "fix" is not a proper fix though, all it does is specify a region that fits in the source image, at the cost of leaving the bordel pixels empty. The proper solution for this would be to add padding to the image, in that case, the nppiCopyReplicateBorder_16u_C1R should be what you need. See this commit for an example implementation.

Tell me if you'd like me to open a PR.

Robzz avatar Feb 16 '18 12:02 Robzz

Thank you for pointing out! You are right. The current code will cause out-of-range access. We used NPP library incorrectly. Although this library is flexible, but slow in specific case. For example, Since attaching border requires extra memory access, the performance will decrease. Because we value speed, we plan to replace NPP with homemade (above patch) median filter.

ykitta-fixstars avatar Feb 21 '18 01:02 ykitta-fixstars