Non-screen capture save window in client/server crashes engine
Describe the bug
This is happening frequently for a VIP LLNL user.
Am attaching the whole set of debug logs here but also writing excerpt highly suggestive the issue is related to IceT because the failure case seems to never return from ConvertIceTImageToAVTImage (or never gets into it).
Excerpt from A.engine_par.000.5.vlog for a WORKING save...
Executing SetWinAnnotAttsRPC 2048x2048
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: Mesh_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: Mesh_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: FilledBoundary_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: FilledBoundary_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: Subset_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: Subset_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing RenderRPC for the following plots
0, 1, 2,
NetworkManager::RenderSetup: annotMode=2
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0000".
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0001".
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0002".
GetSize: 2048, 2048
0: viewport: x=195, y=151, w=1853, h=1453
ConvertIceTImageToAVTImage: w=2048, h=2048, keepZ=0, keepA=0
ConvertIceTImageToAVTImage: IceTImage: rgba_ubyte=1, z=0
ConvertIceTImageToAVTImage: Copying image data.
ConvertIceTImageToAVTImage: Not reading back zbuffer data
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Engine::GatherData:
writer->MustMergeParallelStreams()=false
useCompression=false
respondWithNull=false
scalableThreshold=-1
currentTotalGlobalCellCount=0
cellCountMultiplier=1
exceeded scalable threshold of -1
sending 12583985 bytes to the viewer 73 from strings.
Number of actual direct writes = 2
Excerpt from A.engine_par.000.5.vlog for a FAILED save...
Executing SetWinAnnotAttsRPC 8192x8192
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: Mesh_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: Mesh_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: FilledBoundary_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: FilledBoundary_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing PrepareUpdatePlotAttsRPC: Subset_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing UpdatePlotAttsRPC: Subset_1.0
Resetting idle timeout to 480 minutes.
Resetting idle timeout to 480 minutes.
Resetting execution timeout to 30 minutes.
Executing RenderRPC for the following plots
0, 1, 2,
NetworkManager::RenderSetup: annotMode=2
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
Found matching annotation attributes for legend Plot0000
Found matching annotation attributes for legend Plot0001
Found matching annotation attributes for legend Plot0002
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0000".
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0001".
VisWinAnnotations::AddAnnotationObject: New Legend object created. It is called "Plot0002".
GetSize: 8192, 8192
0: viewport: x=779, y=603, w=7413, h=5813
signalhandler_core: SIGBRT!
To Reproduce
Steps to reproduce the behavior. For example:
- Run client server from macOS or Linux (I haven't tried Windows) to Pascal
- Open
/usr/gapps/visit/data/multi_ucd3d.silo - Select
parallel pviswith 2 nodes and 36 processors - Put up Mesh Plot, Filled Boundary Plot and Subset plot of domains
- Turn of material 1 and draw
- Now try a flurry of different saves from the Set save options window adjusting either width, which aspect ratio, whether to use screen capture or not. I suspect it will fail only in non-screen-capture mode. In my tests, often tried really, really large widths like 8192 and even 16384 (which saves all black except for legen) and although those large saves did indeed work, the engine would later crash.
Expected behavior
Engine should save a window and not crash.
Attachments
Desktop
- OS and version: macOS and Linux
- VisIt Version: 3.2.2 in both cases
I went to the VIPs office and she showed me a consistent reproducer. I've been able to reproduce with multi_ucd3d.silo running with 36 cores on a node of quartz. This is with VisIt 3.2.2.
- Start VisIt (
visit -v 3.2.2 -debug 5) - Go to a 2x2 window setup
- Set scalable rendering to always
- Open
/usr/gapps/visit/data/multi_ucd3d.silo - Select a 36 node engine in
pdebug - Plot a filled boundary plot
- Go to window 2
- Click on Draw to draw the same filled boundary plot in the second window
- Click on Save window (this saves roughly a 1024x1024 image, the width is 1024, but the height isn't since the window isn't quite square)
- Got to window 1
- Click on Save window
The engine crashes.
A secondary bug, is that if you restart a new engine, it now fails to save the image because of a pipeline usage error. You actually have to clear the window, draw the plots again and then save the image.
I have attached a log file from processor 0.
Looking at the log file, for the successful save, the save starts with
Executing SetWinAnnotAttsRPC 1024x854
Found matching annotation attributes for legend Plot0001
For the crash with the second save we get
Executing SetWinAnnotAttsRPC 1024x854
signalhandler_core: SIGSEGV!
I suspect the crash may be happening just before the message
Found matching annotation attributes for legend Plot0001
is output.
The failure happens in exactly the same manner with 3.3.1. So we can probably debug this with the 3.3RC branch.
Interesting! We have had recent conversations where we have a bit of a funky paradigm for matching annotations with plots. So, maybe that design or associated logic is the culprit.
Interesting thought, now that you mention it.
It's crashing in the function
void
NetworkManager::SetWindowAttributes(EngineVisWinInfo &viswinInfo,
const WindowAttributes &atts, const std::string& extstr,
const double *vexts, const std::string& ctName)
in the file ./engine/main/NetworkManager.C in the code block starting at line 3233.
Hmmm. On both develop and 3.3RC that ref takes me here...
https://github.com/visit-dav/visit/blob/56636a80fd5acc2620285d2d3e66eda6b3d76c99/src/engine/main/NetworkManager.C#L3233-L3243
Is that where you mean?
Off hand, I don't see much in there that could fail except a bad viswin pointer.
That's the correct block of code. I agree, I don't see much that can go wrong there. I'll post more info when I find the exact cause.
After experimenting more with it I found that it also crashes when using a 1x2 window layout with a serial engine. That's a lot easier to debug.
I looked into the how the memory that caused the engine to crash when it was freed was used and couldn't find anything wrong, so I concluded that it was some memory corruption that occurred before the free and had nothing to do with the actual memory being freed.
I then ran valgrind on the engine and got the following output, which points to a memory overwrite down in mesa. Here is the output.
==737== Invalid write of size 8
==737== at 0x4C2E8A3: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1033)
==737== by 0x19070E1D: osmesa_st_framebuffer_flush_front (in /usr/WS1/visit/visit/buildvisit/third_party/mesagl/17.3.9/linux-x86_64_gcc-7.3/lib/libOSMesa.so.8.0.0)
==737== by 0x18E38E25: st_manager_flush_frontbuffer (in /usr/WS1/visit/visit/buildvisit/third_party/mesagl/17.3.9/linux-x86_64_gcc-7.3/lib/libOSMesa.so.8.0.0)
==737== by 0x18C69737: _mesa_make_current (in /usr/WS1/visit/visit/buildvisit/third_party/mesagl/17.3.9/linux-x86_64_gcc-7.3/lib/libOSMesa.so.8.0.0)
==737== by 0x18E38BEC: st_api_make_current (in /usr/WS1/visit/visit/buildvisit/third_party/mesagl/17.3.9/linux-x86_64_gcc-7.3/lib/libOSMesa.so.8.0.0)
==737== by 0x190708D4: OSMesaMakeCurrent (in /usr/WS1/visit/visit/buildvisit/third_party/mesagl/17.3.9/linux-x86_64_gcc-7.3/lib/libOSMesa.so.8.0.0)
==737== by 0x8952B32: vtkOSOpenGLRenderWindow::MakeCurrent() (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737== by 0x88C47CC: vtkOpenGLResourceFreeCallback<vtkOpenGLPolyDataMapper>::Release() (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737== by 0x88DD239: vtkOpenGLRenderWindow::ReleaseGraphicsResources(vtkRenderWindow*) (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737== by 0x8951D81: vtkOSOpenGLRenderWindow::DestroyOffScreenWindow() (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737== by 0x895263D: vtkOSOpenGLRenderWindow::ResizeOffScreenWindow(int, int) (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737== by 0x8952E0F: vtkOSOpenGLRenderWindow::SetSize(int, int) (in /usr/WS1/visit/visit/buildvisit/third_party/vtk/8.1.0/linux-x86_64_gcc-7.3/lib/libvtkRenderingOpenGL2-8.1.so.1)
==737== by 0x79C6B9C: VisWinRendering::SetSize(int, int) (VisWinRendering.C:1961)
==737== by 0x7A1734A: VisWindow::SetSize(int, int) (VisWindow.C:1931)
==737== by 0x4EFA30A: NetworkManager::SetWindowAttributes(EngineVisWinInfo&, WindowAttributes const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (NetworkManager.C:3248)
==737== by 0x4EF9C93: NetworkManager::SetWindowAttributes(WindowAttributes const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) (NetworkManager.C:3181)
==737== by 0x4ED24FC: EngineRPCExecutor<SetWinAnnotAttsRPC>::Execute(SetWinAnnotAttsRPC*) (Executors.h:782)
==737== by 0x4EE5AC2: EngineRPCExecutor<SetWinAnnotAttsRPC>::Update(Subject*) (EngineRPCExecutor.h:33)
==737== by 0x146A6E5A: Subject::Notify() (Subject.C:159)
==737== by 0x14566A0D: AttributeSubject::Notify() (AttributeSubject.C:65)
==737== by 0x146E42A5: Xfer::Process() (Xfer.C:382)
==737== by 0x4EDC709: Engine::ProcessInput() (Engine.C:1886)
==737== by 0x4EDC540: Engine::EventLoop() (Engine.C:1830)
==737== by 0x401C6F: EngineMain(int, char**) (main.C:297)
==737== by 0x401DB5: main (main.C:360)
==737== Address 0x2703d680 is 16 bytes after a block of size 96 in arena "client"
==737==
The code in mesa that is causing the problem is in ./src/gallium/state_trackers/osmesa/osmesa.c and here is the snippet of code.
/*
* Copy the color buffer from the resource to the user's buffer.
*/
bpp = util_format_get_blocksize(osbuffer->visual.color_format);
src = map;
dst = osbuffer->map;
if (osmesa->user_row_length)
dst_stride = bpp * osmesa->user_row_length;
else
dst_stride = bpp * osbuffer->width;
bytes = bpp * res->width0;
if (osmesa->y_up) {
/* need to flip image upside down */
dst = dst + (res->height0 - 1) * dst_stride;
dst_stride = -dst_stride;
}
for (y = 0; y < res->height0; y++) {
memcpy(dst, src, bytes);
dst += dst_stride;
src += transfer->stride;
}
The problem is that res->height0 and res->width0 are from the size of the osmesa window after the resize, but the buffer is still from the old size. The new height and width are larger than the old size and you get a memcpy outside the buffer.
I found that when you call OSMesaMakeCurrent you pass it the buffer for the image along with the height and width of the window. When VTK makes a call OSMesaMakeCurrent in vtkOSOpenGLRenderWindow as part of the window resize it passes the old buffer before the resize and the sizes after the resize. This is the problem! There will need to be a slight bit of code reordering to avoid this situation.
Great find, @brugger1 💪🏻 .
Here is some debugging information that traces the creation of off screen mesa contexts along with some descriptive information. This traces the crash. The first number is the line number in the log file.
//
// Here we create the offscreen window for window 1
// We set the size and then create it, creating the buffer for the offscreen window
// and then make it the current context.
//
82 vtkOSOpenGLRenderWindow::SetSize enter this=0xc3aef0,width=300,height=300
85 vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xc3aef0
95 vtkOSOpenGLRenderWindow::CreateOffScreenWindow enter this=0xc3aef0
97 vtkOSMesaCreateWindow buf=0xd82960,width=300,height=300
100 OSMesaMakeCurrent: osmesa=d82710,stctx=fcde70,buffer=d82960,width=300,height=300
//
// Here we resize the offscreen window for window 1
// We set the size and destroy the old one.
//
126 vtkOSOpenGLRenderWindow::SetSize enter this=0xc3aef0,width=731,height=661
129 vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xc3aef0
133 vtkOpenGLRenderWindow::ReleaseGraphicsResources enter
142 OSMesaDestroyContext osmesa=d82710,stctx=fcde70
191 vtkOSMesaDestroyWindow Window=0xd82960
// Here we create the new offscreen buffer and make it the current context.
198 vtkOSMesaCreateWindow buf=0x2aaaca2be010,width=731,height=661
201 OSMesaMakeCurrent: osmesa=10bfd30,stctx=e947d0,buffer=ca2be010,width=731,height=661
//
// Here we create the offscreen window for window 2
//
267 vtkOSOpenGLRenderWindow::SetSize enter this=0xb24000,width=300,height=300
270 vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xb24000
280 vtkOSOpenGLRenderWindow::CreateOffScreenWindow enter this=0xb24000
282 vtkOSMesaCreateWindow buf=0x4663750,width=300,height=300
285 OSMesaMakeCurrent: osmesa=4621640,stctx=49f8990,buffer=4663750,width=300,height=300
//
// Here we resize the offscreen window for window 2
// We set the size and destroy the old one.
//
315 vtkOSOpenGLRenderWindow::SetSize enter this=0xb24000,width=731,height=661
318 vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xb24000
322 vtkOpenGLRenderWindow::ReleaseGraphicsResources enter
331 OSMesaDestroyContext osmesa=4621640,stctx=49f8990
380 vtkOSMesaDestroyWindow Window=0x4663750
// Here we create the new offscreen buffer and make it the current context.
387 vtkOSMesaCreateWindow buf=0x4663750,width=731,height=661
390 OSMesaMakeCurrent: osmesa=4621640,stctx=48421a0,buffer=4663750,width=731,height=661
//
// Here we resize the offscreen window for window 2 for saving the image.
// we set the size and destroy the old one.
//
472 vtkOSOpenGLRenderWindow::SetSize enter this=0xb24000,width=1024,height=925
475 vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xb24000
479 vtkOpenGLRenderWindow::ReleaseGraphicsResources enter
919 OSMesaDestroyContext osmesa=4621640,stctx=48421a0
979 vtkOSMesaDestroyWindow Window=0x4663750
// Here we create the new offscreen buffer and make it the current context.
986 vtkOSMesaCreateWindow buf=0x4ae8d10,width=1024,height=925
989 OSMesaMakeCurrent: osmesa=621fce0,stctx=4866510,buffer=4ae8d10,width=1024,height=925
//
// Here we resize the offscreen window for window 1 for saving the image.
// we set the size and crash making the context the window 1 context with the new
// width and height (incorrect). This results in the flushing of the front buffer
// for the window 2 context with the buffer from window 1 with the new width and
// height causing a memory overwrite.
//
1509 vtkOSOpenGLRenderWindow::SetSize enter this=0xc3aef0,width=1024,height=925
1512 vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xc3aef0
1515 vtkOSOpenGLRenderWindow::DestroyOffScreenWindow enter this=0xc3aef0
1516 vtkOpenGLRenderWindow::ReleaseGraphicsResources enter
1518 OSMesaMakeCurrent: osmesa=10bfd30,stctx=e947d0,buffer=ca2be010,width=1024,height=925
1519 osmesa_st_framebuffer_flush_front stctx=4866510, stfbi=934b20
##### ososbuffer->map=ca2be010
##### dst=ca65a010,src=6a36780,bytes=4096,height=925
##### dst_stride=-4096, transfer->stride=4096
1994 Running: engine_ser -dv -host 127.0.0.1 -port 5601
This traces the creation of the context that causes the crash. This is with the fix and doesn't crash. All the trace information is similar up to that point.
//
// Here we resize the offscreen window for window 1 for saving the image.
// we set the size and make the contex the window 1 context with the old width and
// height (correct). This results in the flushing of the front buffer for the
// window 2 context with the buffer from window 2. Note that in the above case
// the flushing is with the window 1 context and window 1 buffer.
//
1491 vtkOSOpenGLRenderWindow::SetSize enter this=0xc3aef0,width=1024,height=925
1494 vtkOSOpenGLRenderWindow::ResizeOffScreenWindow enter this=0xc3aef0
1497 vtkOSOpenGLRenderWindow::DestroyOffScreenWindow enter this=0xc3aef0
1498 vtkOpenGLRenderWindow::ReleaseGraphicsResources enter
1500 OSMesaMakeCurrent: osmesa=10bfd30,stctx=e947d0,buffer=ca2be010,width=731,height=661
1501 osmesa_st_framebuffer_flush_front stctx=4866510, stfbi=5090e00
##### ososbuffer->map=4ae8d10
##### dst=4e84d10,src=cb49e040,bytes=4096,height=925
##### dst_stride=-4096, transfer->stride=4096
Thats a mouthfull (or screenfull I should say) to try to decipher but am wondering about that negative stride, dst_stride=-4096.
Yes, the negative stride is correct. It is filling the buffer in reverse order line by line.