visit icon indicating copy to clipboard operation
visit copied to clipboard

Non-screen capture save window in client/server crashes engine

Open markcmiller86 opened this issue 3 years ago • 9 comments

Describe the bug

This is happening frequently for a VIP LLNL user.

Am attaching the whole set of debug logs here but also writing excerpt highly suggestive the issue is related to IceT because the failure case seems to never return from ConvertIceTImageToAVTImage (or never gets into it).

Excerpt from A.engine_par.000.5.vlog for a WORKING save...

Executing SetWinAnnotAttsRPC 2048x2048
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: Mesh_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: Mesh_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: FilledBoundary_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: FilledBoundary_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: Subset_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: Subset_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing RenderRPC for the following plots
   0, 1, 2,
NetworkManager::RenderSetup: annotMode=2
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0000".
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0001".
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0002".
GetSize: 2048, 2048
0: viewport: x=195, y=151, w=1853, h=1453
ConvertIceTImageToAVTImage: w=2048, h=2048, keepZ=0, keepA=0
ConvertIceTImageToAVTImage: IceTImage: rgba_ubyte=1, z=0
ConvertIceTImageToAVTImage: Copying image data.
ConvertIceTImageToAVTImage: Not reading back zbuffer data
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Engine::GatherData:
  writer->MustMergeParallelStreams()=false
  useCompression=false
  respondWithNull=false
  scalableThreshold=-1
  currentTotalGlobalCellCount=0
  cellCountMultiplier=1
exceeded scalable threshold of -1
sending 12583985 bytes to the viewer 73 from strings.
Number of actual direct writes = 2

Excerpt from A.engine_par.000.5.vlog for a FAILED save...

Executing SetWinAnnotAttsRPC 8192x8192
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: Mesh_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: Mesh_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: FilledBoundary_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: FilledBoundary_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: Subset_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: Subset_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing RenderRPC for the following plots
   0, 1, 2,
NetworkManager::RenderSetup: annotMode=2
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0000".
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0001".
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0002".
GetSize: 8192, 8192
0: viewport: x=779, y=603, w=7413, h=5813
signalhandler_core: SIGBRT!

To Reproduce

Steps to reproduce the behavior. For example:

  1. Run client server from macOS or Linux (I haven't tried Windows) to Pascal
  2. Open /usr/gapps/visit/data/multi_ucd3d.silo
  3. Select parallel pvis with 2 nodes and 36 processors
  4. Put up Mesh Plot, Filled Boundary Plot and Subset plot of domains
  5. Turn of material 1 and draw
  6. Now try a flurry of different saves from the Set save options window adjusting either width, which aspect ratio, whether to use screen capture or not. I suspect it will fail only in non-screen-capture mode. In my tests, often tried really, really large widths like 8192 and even 16384 (which saves all black except for legen) and although those large saves did indeed work, the engine would later crash.

Expected behavior

Engine should save a window and not crash.

Attachments

Desktop

  • OS and version: macOS and Linux
  • VisIt Version: 3.2.2 in both cases

markcmiller86 avatar Oct 06 '22 19:10 markcmiller86

I went to the VIPs office and she showed me a consistent reproducer. I've been able to reproduce with multi_ucd3d.silo running with 36 cores on a node of quartz. This is with VisIt 3.2.2.

  1. Start VisIt (visit -v 3.2.2 -debug 5)
  2. Go to a 2x2 window setup
  3. Set scalable rendering to always
  4. Open /usr/gapps/visit/data/multi_ucd3d.silo
  5. Select a 36 node engine in pdebug
  6. Plot a filled boundary plot
  7. Go to window 2
  8. Click on Draw to draw the same filled boundary plot in the second window
  9. Click on Save window (this saves roughly a 1024x1024 image, the width is 1024, but the height isn't since the window isn't quite square)
  10. Got to window 1
  11. Click on Save window

The engine crashes.

A secondary bug, is that if you restart a new engine, it now fails to save the image because of a pipeline usage error. You actually have to clear the window, draw the plots again and then save the image.

I have attached a log file from processor 0.

A.engine_par.000.5.vlog.txt

brugger1 avatar Oct 07 '22 21:10 brugger1

Looking at the log file, for the successful save, the save starts with

Executing SetWinAnnotAttsRPC 1024x854
Found matching annotation attributes for legend Plot0001

For the crash with the second save we get

Executing SetWinAnnotAttsRPC 1024x854
signalhandler_core: SIGSEGV!

I suspect the crash may be happening just before the message

Found matching annotation attributes for legend Plot0001

is output.

brugger1 avatar Oct 07 '22 21:10 brugger1

The failure happens in exactly the same manner with 3.3.1. So we can probably debug this with the 3.3RC branch.

brugger1 avatar Oct 07 '22 21:10 brugger1

Interesting! We have had recent conversations where we have a bit of a funky paradigm for matching annotations with plots. So, maybe that design or associated logic is the culprit.

markcmiller86 avatar Oct 07 '22 21:10 markcmiller86

Interesting thought, now that you mention it.

brugger1 avatar Oct 07 '22 21:10 brugger1

It's crashing in the function

void
NetworkManager::SetWindowAttributes(EngineVisWinInfo &viswinInfo,
    const WindowAttributes &atts, const std::string& extstr,
    const double *vexts, const std::string& ctName)

in the file ./engine/main/NetworkManager.C in the code block starting at line 3233.

brugger1 avatar Oct 07 '22 22:10 brugger1

Hmmm. On both develop and 3.3RC that ref takes me here...

https://github.com/visit-dav/visit/blob/56636a80fd5acc2620285d2d3e66eda6b3d76c99/src/engine/main/NetworkManager.C#L3233-L3243

Is that where you mean?

markcmiller86 avatar Oct 07 '22 23:10 markcmiller86

Off hand, I don't see much in there that could fail except a bad viswin pointer.

markcmiller86 avatar Oct 07 '22 23:10 markcmiller86

That's the correct block of code. I agree, I don't see much that can go wrong there. I'll post more info when I find the exact cause.

brugger1 avatar Oct 10 '22 16:10 brugger1

After experimenting more with it I found that it also crashes when using a 1x2 window layout with a serial engine. That's a lot easier to debug.

brugger1 avatar Oct 28 '22 22:10 brugger1

I looked into the how the memory that caused the engine to crash when it was freed was used and couldn't find anything wrong, so I concluded that it was some memory corruption that occurred before the free and had nothing to do with the actual memory being freed.

I then ran valgrind on the engine and got the following output, which points to a memory overwrite down in mesa. Here is the output.

==737== Invalid write of size 8
==737==    at 0x4C2E8A3: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1033)
==737==    by 0x19070E1D: osmesa_st_framebuffer_flush_front (in /usr/WS1/visit/visit/buildvisit/third_party/mesagl/17.3.9/linux-x86_64_gcc-7.3/lib/libOSMesa.so.8.0.0)
==737==    by 0x18E38E25: st_manager_flush_frontbuffer (in /usr/WS1/visit/visit/buildvisit/third_party/mesagl/17.3.9/linux-x86_64_gcc-7.3/lib/libOSMesa.so.8.0.0)
==737==    by 0x18C69737: _mesa_make_current (in /usr/WS1/visit/visit/buildvisit/third_party/mesagl/17.3.9/linux-x86_64_gcc-7.3/lib/libOSMesa.so.8.0.0)
==737==    by 0x18E38BEC: st_api_make_current (in /usr/WS1/visit/visit/buildvisit/third_party/mesagl/17.3.9/linux-x86_64_gcc-7.3/lib/libOSMesa.so.8.0.0)
==737==    by 0x190708D4: OSMesaMakeCurrent (in /usr/WS1/visit/visit/buildvisit/third_party/mesagl/17.3.9/linux-x86_64_gcc-7.3/lib/libOSMesa.so.8.0.0)
==737==    by 0x8952B32: vtkOSOpenGLRenderWindow::MakeCurrent() (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737==    by 0x88C47CC: vtkOpenGLResourceFreeCallback<vtkOpenGLPolyDataMapper>::Release() (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737==    by 0x88DD239: vtkOpenGLRenderWindow::ReleaseGraphicsResources(vtkRenderWindow*) (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737==    by 0x8951D81: vtkOSOpenGLRenderWindow::DestroyOffScreenWindow() (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737==    by 0x895263D: vtkOSOpenGLRenderWindow::ResizeOffScreenWindow(int, int) (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737==    by 0x8952E0F: vtkOSOpenGLRenderWindow::SetSize(int, int) (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737==    by 0x79C6B9C: VisWinRendering::SetSize(int, int) (VisWinRendering.C:1961)
==737==    by 0x7A1734A: VisWindow::SetSize(int, int) (VisWindow.C:1931)
==737==    by 0x4EFA30A: NetworkManager::SetWindowAttributes(EngineVisWinInfo&, WindowAttributes const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (NetworkManager.C:3248)
==737==    by 0x4EF9C93: NetworkManager::SetWindowAttributes(WindowAttributes const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) (NetworkManager.C:3181)
==737==    by 0x4ED24FC: EngineRPCExecutor<SetWinAnnotAttsRPC>::Execute(SetWinAnnotAttsRPC*) (Executors.h:782)
==737==    by 0x4EE5AC2: EngineRPCExecutor<SetWinAnnotAttsRPC>::Update(Subject*) (EngineRPCExecutor.h:33)
==737==    by 0x146A6E5A: Subject::Notify() (Subject.C:159)
==737==    by 0x14566A0D: AttributeSubject::Notify() (AttributeSubject.C:65)
==737==    by 0x146E42A5: Xfer::Process() (Xfer.C:382)
==737==    by 0x4EDC709: Engine::ProcessInput() (Engine.C:1886)
==737==    by 0x4EDC540: Engine::EventLoop() (Engine.C:1830)
==737==    by 0x401C6F: EngineMain(int, char**) (main.C:297)
==737==    by 0x401DB5: main (main.C:360)
==737==  Address 0x2703d680 is 16 bytes after a block of size 96 in arena "client"
==737==

The code in mesa that is causing the problem is in ./src/gallium/state_trackers/osmesa/osmesa.c and here is the snippet of code.

   /*
    * Copy the color buffer from the resource to the user's buffer.
    */
   bpp = util_format_get_blocksize(osbuffer->visual.color_format);
   src = map;
   dst = osbuffer->map;
   if (osmesa->user_row_length)
      dst_stride = bpp * osmesa->user_row_length;
   else
      dst_stride = bpp * osbuffer->width;
   bytes = bpp * res->width0;

   if (osmesa->y_up) {
      /* need to flip image upside down */
      dst = dst + (res->height0 - 1) * dst_stride;
      dst_stride = -dst_stride;
   }

   for (y = 0; y < res->height0; y++) {
      memcpy(dst, src, bytes);
      dst += dst_stride;
      src += transfer->stride;
   }

The problem is that res->height0 and res->width0 are from the size of the osmesa window after the resize, but the buffer is still from the old size. The new height and width are larger than the old size and you get a memcpy outside the buffer.

I found that when you call OSMesaMakeCurrent you pass it the buffer for the image along with the height and width of the window. When VTK makes a call OSMesaMakeCurrent in vtkOSOpenGLRenderWindow as part of the window resize it passes the old buffer before the resize and the sizes after the resize. This is the problem! There will need to be a slight bit of code reordering to avoid this situation.

brugger1 avatar Oct 28 '22 23:10 brugger1

Great find, @brugger1 💪🏻 .

markcmiller86 avatar Oct 28 '22 23:10 markcmiller86

Here is some debugging information that traces the creation of off screen mesa contexts along with some descriptive information. This traces the crash. The first number is the line number in the log file.

//
// Here we create the offscreen window for window 1
//   We set the size and then create it, creating the buffer for the offscreen window
//   and then make it the current context.
//
82   vtkOSOpenGLRenderWindow::SetSize enter this=0xc3aef0,width=300,height=300
85   vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xc3aef0
95   vtkOSOpenGLRenderWindow::CreateOffScreenWindow enter this=0xc3aef0
97   vtkOSMesaCreateWindow buf=0xd82960,width=300,height=300
100  OSMesaMakeCurrent: osmesa=d82710,stctx=fcde70,buffer=d82960,width=300,height=300

//
// Here we resize the offscreen window for window 1
//   We set the size and destroy the old one.
//
126  vtkOSOpenGLRenderWindow::SetSize enter this=0xc3aef0,width=731,height=661
129  vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xc3aef0
133  vtkOpenGLRenderWindow::ReleaseGraphicsResources enter
142  OSMesaDestroyContext osmesa=d82710,stctx=fcde70
191  vtkOSMesaDestroyWindow Window=0xd82960

//   Here we create the new offscreen buffer and make it the current context.
198  vtkOSMesaCreateWindow buf=0x2aaaca2be010,width=731,height=661
201  OSMesaMakeCurrent: osmesa=10bfd30,stctx=e947d0,buffer=ca2be010,width=731,height=661

//
// Here we create the offscreen window for window 2
// 
267  vtkOSOpenGLRenderWindow::SetSize enter this=0xb24000,width=300,height=300
270  vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xb24000
280  vtkOSOpenGLRenderWindow::CreateOffScreenWindow enter this=0xb24000
282  vtkOSMesaCreateWindow buf=0x4663750,width=300,height=300
285  OSMesaMakeCurrent: osmesa=4621640,stctx=49f8990,buffer=4663750,width=300,height=300

//
// Here we resize the offscreen window for window 2
//   We set the size and destroy the old one.
//
315  vtkOSOpenGLRenderWindow::SetSize enter this=0xb24000,width=731,height=661
318  vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xb24000
322  vtkOpenGLRenderWindow::ReleaseGraphicsResources enter
331  OSMesaDestroyContext osmesa=4621640,stctx=49f8990
380  vtkOSMesaDestroyWindow Window=0x4663750

//   Here we create the new offscreen buffer and make it the current context.
387  vtkOSMesaCreateWindow buf=0x4663750,width=731,height=661
390  OSMesaMakeCurrent: osmesa=4621640,stctx=48421a0,buffer=4663750,width=731,height=661

//
// Here we resize the offscreen window for window 2 for saving the image.
//   we set the size and destroy the old one.
//
472  vtkOSOpenGLRenderWindow::SetSize enter this=0xb24000,width=1024,height=925
475  vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xb24000
479  vtkOpenGLRenderWindow::ReleaseGraphicsResources enter
919  OSMesaDestroyContext osmesa=4621640,stctx=48421a0
979  vtkOSMesaDestroyWindow Window=0x4663750

//   Here we create the new offscreen buffer and make it the current context.
986  vtkOSMesaCreateWindow buf=0x4ae8d10,width=1024,height=925
989  OSMesaMakeCurrent: osmesa=621fce0,stctx=4866510,buffer=4ae8d10,width=1024,height=925

//
// Here we resize the offscreen window for window 1 for saving the image.
//   we set the size and crash making the context the window 1 context with the new
//   width and height (incorrect). This results in the flushing of the front buffer
//   for the window 2 context with the buffer from window 1 with the new width and
//   height causing a memory overwrite.
//
1509 vtkOSOpenGLRenderWindow::SetSize enter this=0xc3aef0,width=1024,height=925
1512 vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xc3aef0
1515 vtkOSOpenGLRenderWindow::DestroyOffScreenWindow enter this=0xc3aef0
1516 vtkOpenGLRenderWindow::ReleaseGraphicsResources enter
1518 OSMesaMakeCurrent: osmesa=10bfd30,stctx=e947d0,buffer=ca2be010,width=1024,height=925
1519 osmesa_st_framebuffer_flush_front stctx=4866510, stfbi=934b20
##### ososbuffer->map=ca2be010
##### dst=ca65a010,src=6a36780,bytes=4096,height=925
##### dst_stride=-4096, transfer->stride=4096

1994 Running: engine_ser -dv -host 127.0.0.1 -port 5601

This traces the creation of the context that causes the crash. This is with the fix and doesn't crash. All the trace information is similar up to that point.

//
// Here we resize the offscreen window for window 1 for saving the image.
//   we set the size and make the contex the window 1 context with the old width and
//   height (correct). This results in the flushing of the front buffer for the
//   window 2 context with the buffer from window 2. Note that in the above case
//   the flushing is with the window 1 context and window 1 buffer.
//
1491 vtkOSOpenGLRenderWindow::SetSize enter this=0xc3aef0,width=1024,height=925
1494 vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xc3aef0
1497 vtkOSOpenGLRenderWindow::DestroyOffScreenWindow enter this=0xc3aef0
1498 vtkOpenGLRenderWindow::ReleaseGraphicsResources enter
1500 OSMesaMakeCurrent: osmesa=10bfd30,stctx=e947d0,buffer=ca2be010,width=731,height=661
1501 osmesa_st_framebuffer_flush_front stctx=4866510, stfbi=5090e00
##### ososbuffer->map=4ae8d10
##### dst=4e84d10,src=cb49e040,bytes=4096,height=925
##### dst_stride=-4096, transfer->stride=4096

brugger1 avatar Oct 31 '22 23:10 brugger1

Thats a mouthfull (or screenfull I should say) to try to decipher but am wondering about that negative stride, dst_stride=-4096.

markcmiller86 avatar Oct 31 '22 23:10 markcmiller86

Yes, the negative stride is correct. It is filling the buffer in reverse order line by line.

brugger1 avatar Oct 31 '22 23:10 brugger1