godot icon indicating copy to clipboard operation
godot copied to clipboard

Address Android editor crashes

Open m4gr3d opened this issue 2 years ago • 10 comments
trafficstars

Add workaround for crash when resuming existing project

The crash occurs when trying to open existing vulkan mobile projects on the Godot Android editor, and seems to be caused by a mismatch between the supposed size of the ARRAY_INDEX and its actual size. The change in this PR is a workaround which allows to open the project successfully, but the underlying should be addressed.

Attached is a minimal repro project

HelloGodotVulkan.zip

To repro:

  1. Build a dev version of the Android editor from the master branch and install it on an Android device.
  2. Unzip the repro project, and push it to the Android device
  3. Open the repro project on the Android device using the installed Android editor
  4. Without the change in this draft PR, the editor crashes. With the change, you should see in logcat several errors.

m4gr3d avatar Dec 05 '22 18:12 m4gr3d

Seems fine to me. Since we are pulling the index array from the RenderingServer, this seems like an appropriate place to do error checking.

Would appreciate @kleonc's review as well before merging

@clayjohn Is that an expected behavior then for the ARRAY_INDEX size to be invalid? This workaround is to highlight the issue, but I don't think it should be the full fix. I'm hoping you or @BastiaanOlij can provide some insights around the size mismatch so we can resolve the issue at the root.

m4gr3d avatar Dec 05 '22 19:12 m4gr3d

Seems fine to me. Since we are pulling the index array from the RenderingServer, this seems like an appropriate place to do error checking.

Would appreciate @kleonc's review as well before merging

@clayjohn Note the check is added just for PRIMITIVE_TRIANGLES and the same thing is not checked for PRIMITIVE_TRIANGLE_STRIP (so this case can still potentially crash).

Is that an expected behavior then for the ARRAY_INDEX size to be invalid?

@m4gr3d It's rather the indices within the ARRAY_INDEX being invalid (< 0 or >= length(ARRAY_VERTEX)). The question is why it happened. I think a crash log could point to where the original issue is originated.

kleonc avatar Dec 05 '22 20:12 kleonc

I wonder if the original issue comes from a vulkan issue with reading back vertices from the GPU.

clayjohn avatar Dec 05 '22 20:12 clayjohn

OK this is a really weird one. I agree with @m4gr3d that the check above is a workaround, it doesn't solve the actual problem, just prevents a crash and ensures that for one of the meshes, we're not getting the data we want.

In this scene all meshes are subclasses of PrimitiveMesh, which indeed retrieves the mesh data from the rendering server as it doesn't retain a copy of the data it's sent to the rendering server (unlike say ArrayMesh but similar to many other implementations of Mesh).

surface_get_array_len and surface_get_array_index_len however retrieve cached values stored when the mesh was created so if there is anything that happens on the rendering server that changes the mesh data (say generating LODs which may be applied here and I think is a Vulkan only server implementation, though I can't see evidence of LOD being applied) then that could explain what is happening here. edit I think LOD is handled at import, not in the rendering server, and indeed PrimitiveMesh is not using LOD..

The other explanation (but I would be expecting more issues in that case, and I would expect OpenGL to fail too) is that there is an error in one of the 3 primitive meshes used here where it adds an index to an out of bounds vertex.

I think the first fix here should be to change the surface_get_array_len and surface_get_array_index_len logic to retrieve these values from the rendering server instead of using cached values, so we get values that match up with the data retrieved.

Second, I question whether surface_get_array_len and surface_get_array_index_len should even be used in Mesh::generate_triangle_mesh, the retrieved data from surface_get_arrays already contains this info. The moment Vector<Vector3> vertices = a[ARRAY_VERTEX]; is executed, vertices .size() will give the correct value and it saves a potentially expensive query on the rendering server.

Third, if this is indeed LOD related (or something similar), there may be a potential flaw introduced into generate_triangle_mesh. With LODs the vertex buffer contains all vertices as many LOD levels share vertices with each other, but it then has a separate index buffer for each LOD. It is unclear to me what is in the main index buffer. I'm assuming this is just LOD 0 in which case there is no problem, but it could also be a combined buffer.

I'll do some more testing to figure out which is what :)

BastiaanOlij avatar Dec 05 '22 23:12 BastiaanOlij

Ok, I had to move surface_get_array_len and surface_get_array_index_len into Mesh::_bind_methods so I could access this in GDScript but:

extends Node3D

func _report_mesh(mesh : Mesh):
	var meshdata : Array = mesh.surface_get_arrays(0)
	print("Vertex count: %d (%d reported)" % [ meshdata[Mesh.ARRAY_VERTEX].size(), mesh.surface_get_array_len(0) ])
	print("Index count: %d (%d reported)" % [ meshdata[Mesh.ARRAY_INDEX].size(), mesh.surface_get_array_index_len(0) ])

	var vc : int = meshdata[Mesh.ARRAY_VERTEX].size()
	for index in meshdata[Mesh.ARRAY_INDEX]:
		if int(index) >= vc:
			print("Index out of bound ", index)

# Called when the node enters the scene tree for the first time.
func _ready():
	print("Floor")
	_report_mesh($MeshInstance3D.mesh)
	print("Tree")
	_report_mesh($MeshInstance3D/Tree.mesh)
	print("Tree stem")
	_report_mesh($MeshInstance3D/Tree/MeshInstance3D.mesh)

Gives me:

Floor
Vertex count: 4 (4 reported)
Index count: 6 (6 reported)
Tree
Vertex count: 522 (522 reported)
Index count: 2304 (2304 reported)
Tree stem
Vertex count: 2210 (2210 reported)
Index count: 12672 (12672 reported)

So on desktop at least I'm getting matching counts and all the indices seem to be valid.

I am wondering if this could be a timing thing? Is it still pushing data to the GPU and are we thus reading junk thats currently there?

Or could this be a driver bug in Android reading the buffer data?

BastiaanOlij avatar Dec 05 '22 23:12 BastiaanOlij

We are going to have to debug a little on an actual Android device and see where exactly we are losing data. I came to the same conclusion as you after running it on desktop for a bit, everything appears to work fine on the desktop and there are no obvious places where the desktop and mobile behaviour differ

clayjohn avatar Dec 05 '22 23:12 clayjohn

FWIW, I've been testing on a Pixel 5 device. On that device the issue always reproduce when trying to open the linked repro project in the Android editor.

m4gr3d avatar Dec 06 '22 00:12 m4gr3d

yeah I'm about to run it on my pixel 6, just going through the motions of compiling Android which takes forever on my computer

BastiaanOlij avatar Dec 06 '22 00:12 BastiaanOlij

Output on my Pixel 6:

Godot Engine v4.0.beta.custom_build.015dc492d - https://godotengine.org
Vulkan API 1.1.0 - Using Vulkan Device #0: ARM - Mali-G78
 
Floor
Vertex count: 4 (4 reported)
Index count: 6 (6 reported)
Index out of bound 16968
Index out of bound 16968
Tree
Vertex count: 522 (522 reported)
Index count: 2304 (2304 reported)
Index out of bound 16448
Index out of bound 16128
Index out of bound 32767
Index out of bound 32767
Index out of bound 65535
Index out of bound 49151
Index out of bound 48438
Index out of bound 15688
Index out of bound 16448

So yeah, looks pretty clear that it is not getting proper index information from the buffers. I'm going to add in a delay and see if this is a timing issue

BastiaanOlij avatar Dec 06 '22 00:12 BastiaanOlij

Ok, the delay didn't help so this is not a problem with the data being sent to the GPU not completing before we read it.

We did find an error in RenderingDeviceVulkan::buffer_get_data that our call to _buffer_memory_barrier has it's flags mixed up so that will need to be fixed. That said, that barrier would cause us to wait before writing is finished and we've already ruled that out.

My best guess is that:

vkCmdCopyBuffer(command_buffer, buffer->buffer, tmp_buffer.buffer, 1, &region); // Dst buffer is in CPU, but I wonder if src buffer needs a barrier for this.
// Flush everything so memory can be safely mapped.
_flush(true);

Actually doesn't wait for the copy to CPU memory to be finished and that copying the data into our result buffer thus fails as the data isn't available yet.

We may need another barrier after the copy buffer command has been issued?

BastiaanOlij avatar Dec 06 '22 01:12 BastiaanOlij

@BastiaanOlij Should we submit this workaround for the time being since the proper fix requires further investigation? It'll at least prevent certain vulkan projects from crashing on start.

m4gr3d avatar Jan 22 '23 14:01 m4gr3d

Thanks!

akien-mga avatar Jan 22 '23 16:01 akien-mga

@m4gr3d already merged but yes as a workaround... but with the footnote that basically this makes the functionality defunct.

We really need to find the proper cause of this and fix it because now you have very unreliable code that sometimes results in incomplete data.

We are completely dependent on whether the copy from GPU to CPU happens before the data is consumed because the barrier that is supposed to cause the code to wait for completion is not working on Android.

BastiaanOlij avatar Jan 23 '23 02:01 BastiaanOlij

On another foot note, worth discussing in a wider group. When we sent the data to the rendering server we submit it to the GPU without keeping a copy on the CPU. This is a sensible approach because it is just a waste of memory to store it when 99% of the time, the data is never accessed again.

But when it is expected to be queried, like it is in the editor, then the current approach is very wasteful, pulling the data out of the GPU is costly.

I wonder if it makes sense to add an option to the rendering server which informs it we should keep the data on the CPU side and just restore the stored data when requested. Higher memory usage but potentially worth it when the data is regularly accessed.

BastiaanOlij avatar Jan 23 '23 03:01 BastiaanOlij

Ok, the delay didn't help so this is not a problem with the data being sent to the GPU not completing before we read it.

We did find an error in RenderingDeviceVulkan::buffer_get_data that our call to _buffer_memory_barrier has it's flags mixed up so that will need to be fixed. That said, that barrier would cause us to wait before writing is finished and we've already ruled that out.

My best guess is that:

vkCmdCopyBuffer(command_buffer, buffer->buffer, tmp_buffer.buffer, 1, &region); // Dst buffer is in CPU, but I wonder if src buffer needs a barrier for this.
// Flush everything so memory can be safely mapped.
_flush(true);

Actually doesn't wait for the copy to CPU memory to be finished and that copying the data into our result buffer thus fails as the data isn't available yet.

We may need another barrier after the copy buffer command has been issued?

Is it possible we are dealing with non-coherent memory?

VK_MEMORY_PROPERTY_HOST_COHERENT_BIT flag occurs on memory types that are also HOST_VISIBLE and means that writes/read to this memory on the CPU are made coherent automatically. Without this flag, you need to call vkFlushMappedMemoryRanges after writing and vkInvalidateMappedMemoryRanges before reading the memory via CPU pointer, before/after you use it on the GPU, to make sure caches are flushed/invalidated automatically. Note that mapping/unmapping memory doesn’t play a role here and is not even necessary – you can leave your memory persistently mapped while used on the GPU, as long as you ensure proper synchronization e.g. using VkFence. The reason I didn’t talk about this flag is that all HOST_VISIBLE memory types on all GPUs I’ve ever seen on Windows PC have HOST_COHERENT flag also set. This may change in the future, so a fully robust application should watch out for memory types without it and flush/invalidate accordingly, but for now, non-HOST_COHERENT memory types are a thing on mobile GPUs. Same with VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT – a flag that can be found only on mobile chips currently and not on the PC.

https://asawicki.info/news_1740_vulkan_memory_types_on_pc_and_how_to_use_them

Comparing RenderingDeviceVulkan::buffer_get_data with this Khronos read back example, it looks like we are missing a VK_ACCESS_HOST_READ_BIT/VK_PIPELINE_STAGE_HOST_BIT barrier and a vkInvalidateMappedMemoryRanges call.

sakrel avatar May 13 '23 00:05 sakrel