CompreFace icon indicating copy to clipboard operation
CompreFace copied to clipboard

Out of memmory error on SubCenter-ArcFace-r100-gpu (ubuntu 22.04, Nvidia GTX 10603gb)

Open martinenkoEduard opened this issue 2 years ago • 3 comments

it works for a while (and I must say it is blazingly FAST) and after ~50 images it starts to drop images with this error:

face-api | compreface-ui | 172.20.0.1 - - [22/Jul/2022:21:38:25 +0000] "POST /api/v1/detection/detect?&face_plugins=calculator HTTP/1.1" 500 467 "-" "python-requests/2.25.1" compreface-core | {"severity": "CRITICAL", "message": "MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n", "request": {"method": "POST", "path": "/find_faces", "filename": "image.jpg", "api_key": "", "remote_addr": "172.20.0.4"}, "logger": "src.services.flask_.error_handling", "module": "error_handling", "traceback": "Traceback (most recent call last):\n File "/usr/local/lib/python3.7/dist-packages/flask/app.py", line 1950, in full_dispatch_request\n rv = self.dispatch_request()\n File "/usr/local/lib/python3.7/dist-packages/flask/app.py", line 1936, in dispatch_request\n return self.view_functionsrule.endpoint\n File "./src/services/flask_/needs_attached_file.py", line 32, in wrapper\n return f(*args, **kwargs)\n File "./src/_endpoints.py", line 72, in find_faces_post\n face_plugins=face_plugins\n File "./src/services/facescan/plugins/mixins.py", line 44, in call\n faces = self._fetch_faces(img, det_prob_threshold)\n File "./src/services/facescan/plugins/mixins.py", line 51, in _fetch_faces\n boxes = self.find_faces(img, det_prob_threshold)\n File "./src/services/facescan/plugins/insightface/insightface.py", line 83, in find_faces\n results = self.detection_model.get(img, det_thresh=det_prob_threshold)\n File "/usr/local/lib/python3.7/dist-packages/insightface/app/face_analysis.py", line 39, in get\n bboxes, landmarks = self.det_model.detect(img, threshold=det_thresh, scale = det_scale)\n File "/usr/local/lib/python3.7/dist-packages/insightface/model_zoo/face_detection.py", line 303, in detect\n scores = net_out[idx].asnumpy()\n File "/usr/local/lib/python3.7/dist-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpy\n ctypes.c_size_t(data.size)))\n File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 253, in check_call\n raise MXNetError(py_str(LIB.MXGetLastError()))\nmxnet.base.MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n\n", "build_version": "dev"} compreface-api | 2022-07-22 21:38:25.481 ERROR 7 --- [nio-8080-exec-4] c.e.f.c.h.ResponseExceptionHandler : Defined exception occurred compreface-api | compreface-api | com.exadel.frs.commonservice.sdk.faces.exception.FacesServiceException: Error during synchronization between servers: [500 INTERNAL SERVER ERROR] during [POST] to [http://compreface-core:3000/find_faces] [FacesFeignClient#findFaces(MultipartFile,Integer,Double,String)]: [{"message":"MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: Map... (1133 bytes)] compreface-api | at com.exadel.frs.commonservice.sdk.faces.service.FacesRestApiClient.findFaces(FacesRestApiClient.java:34) compreface-api | at com.exadel.frs.commonservice.sdk.faces.service.FacesRestApiClient$$FastClassBySpringCGLIB$$517e8caf.invoke() compreface-api | at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) compreface-api | at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:687) compreface-api | at com.exadel.frs.commonservice.sdk.faces.service.FacesRestApiClient$$EnhancerBySpringCGLIB$$5f1e9a2e.findFaces() compreface-api | at com.exadel.frs.core.trainservice.service.FaceDetectionProcessServiceImpl.processImage(FaceDetectionProcessServiceImpl.java:31) compreface-api | at com.exadel.frs.core.trainservice.service.FaceDetectionProcessServiceImpl.processImage(FaceDetectionProcessServiceImpl.java:13) compreface-api | at com.exadel.frs.core.trainservice.controller.DetectionController.detect(DetectionController.java:71) compreface-api | at com.exadel.frs.core.trainservice.controller.DetectionController$$FastClassBySpringCGLIB$$6a25be2c.invoke() compreface-api | at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) compreface-api | at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:771) compreface-api | at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) compreface-api | at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749) compreface-api | at org.springframework.validation.beanvalidation.MethodValidationInterceptor.invoke(MethodValidationInterceptor.java:119) compreface-api | at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) compreface-api | at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749) compreface-api | at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:691) compreface-api | at com.exadel.frs.core.trainservice.controller.DetectionController$$EnhancerBySpringCGLIB$$b1c0ae9e.detect() compreface-api | at jdk.internal.reflect.GeneratedMethodAccessor129.invoke(Unknown Source) compreface-api | at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) compreface-api | at java.base/java.lang.reflect.Method.invoke(Unknown Source) compreface-api | at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:190) compreface-api | at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138) compreface-api | at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:105) compreface-api | at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:878) compreface-api | at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:792) compreface-api | at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) compreface-api | at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1040) compreface-api | at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943) compreface-api | at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006) compreface-api | at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:909) compreface-api | at javax.servlet.http.HttpServlet.service(HttpServlet.java:652) compreface-api | at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883) compreface-api | at javax.servlet.http.HttpServlet.service(HttpServlet.java:733) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at com.exadel.frs.core.trainservice.filter.SecurityValidationFilter.doFilter(SecurityValidationFilter.java:124) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:93) compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202) compreface-api | at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97) compreface-api | at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541) compreface-api | at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:143) compreface-api | at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) compreface-api | at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78) compreface-api | at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) compreface-api | at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:374) compreface-api | at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) compreface-api | at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868) compreface-api | at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1590) compreface-api | at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) compreface-api | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) compreface-api | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) compreface-api | at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) compreface-api | at java.base/java.lang.Thread.run(Unknown Source) compreface-api | compreface-ui | 172.20.0.1 - - [22/Jul/2022:21:38:25 +0000] "POST /api/v1/detection/detect?&face_plugins=calculator HTTP/1.1" 500 467 "-" "python-requests/2.25.1" compreface-core | {"severity": "CRITICAL", "message": "MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n", "request": {"method": "POST", "path": "/find_faces", "filename": "image.jpg", "api_key": "", "remote_addr": "172.20.0.4"}, "logger": "src.services.flask.error_handling", "module": "error_handling", "traceback": "Traceback (most recent call last):\n File "/usr/local/lib/python3.7/dist-packages/flask/app.py", line 1950, in full_dispatch_request\n rv = self.dispatch_request()\n File "/usr/local/lib/python3.7/dist-packages/flask/app.py", line 1936, in dispatch_request\n return self.view_functionsrule.endpoint\n File "./src/services/flask/needs_attached_file.py", line 32, in wrapper\n return f(*args, **kwargs)\n File "./src/_endpoints.py", line 72, in find_faces_post\n face_plugins=face_plugins\n File "./src/services/facescan/plugins/mixins.py", line 44, in call\n faces = self._fetch_faces(img, det_prob_threshold)\n File "./src/services/facescan/plugins/mixins.py", line 51, in _fetch_faces\n boxes = self.find_faces(img, det_prob_threshold)\n File "./src/services/facescan/plugins/insightface/insightface.py", line 83, in find_faces\n results = self._detection_model.get(img, det_thresh=det_prob_threshold)\n File "/usr/local/lib/python3.7/dist-packages/insightface/app/face_analysis.py", line 39, in get\n bboxes, landmarks = self.det_model.detect(img, threshold=det_thresh, scale = det_scale)\n File "/usr/local/lib/python3.7/dist-packages/insightface/model_zoo/face_detection.py", line 303, in detect\n scores = net_out[idx].asnumpy()\n File "/usr/local/lib/python3.7/dist-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpy\n ctypes.c_size_t(data.size)))\n File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 253, in check_call\n raise MXNetError(py_str(_LIB.MXGetLastError()))\nmxnet.base.MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n\n", "build_version": "dev"}

martinenkoEduard avatar Jul 22 '22 21:07 martinenkoEduard

Checked through - watch -n0.1 nvidia-smi it goes out of video memory. And it seems that it never cleans it. Because video memory in use only increases...

martinenkoEduard avatar Jul 22 '22 21:07 martinenkoEduard

It much more likely to happen if I use check an image with several faces on it.

martinenkoEduard avatar Jul 24 '22 06:07 martinenkoEduard

In one of the threads, you asked about adding processes in Python. Each process loads the neural network to GPU and it doesn't release the memory. It doesn't make sense to release the memory as it takes too much time to load NN to it. It shouldn't reproduce with one process. So basically, you are limited with the number of processes by GPU memory.

pospielov avatar Jul 25 '22 13:07 pospielov

i have the same problem, config two processs and one thread, the GPU memory only increases sometimes.

In one of the threads, you asked about adding processes in Python. Each process loads the neural network to GPU and it doesn't release the memory. It doesn't make sense to release the memory as it takes too much time to load NN to it. It shouldn't reproduce with one process. So basically, you are limited with the number of processes by GPU memory.

allen20200111 avatar Jan 28 '23 05:01 allen20200111

I created a bug to investigate not sure if we will be able to fix it, as we use the Insightface library as is, without changes under the hood.

pospielov avatar Feb 02 '23 17:02 pospielov