FlatBuffers.jl icon indicating copy to clipboard operation
FlatBuffers.jl copied to clipboard

Building Flatbuffer bytes yields a corrupted array

Open janfrancu opened this issue 6 years ago • 11 comments

When building FlatBuffer bytes, sometimes the array ends up in either of the following states:

  • just showing the array throws segfault in getindex
signal (11): Segmentation fault: 11
in expression starting at no file:0
getindex at ./array.jl:732 [inlined]
alignment at ./arrayshow.jl:68
unknown function (ip: 0x12063304e)
jl_fptr_trampoline at /Users/osx/buildbot/slave/package_osx64/build/src/gf.c:1829
print_matrix at ./arrayshow.jl:186
print_matrix at ./arrayshow.jl:159 [inlined]
print_array at ./arrayshow.jl:308 [inlined]
show at ./arrayshow.jl:345
...
  • showing the array or accessing indexes at the end of the array yields
76287316-element Array{UInt8,1}:
Error showing value of type Array{UInt8,1}:
ERROR: ReadOnlyMemoryError()
Stacktrace:
 [1] getindex at ./array.jl:732 [inlined]
 [2] alignment(::IOContext{REPL.Terminals.TTYTerminal}, ::Array{UInt8,1}, ::Array{Int64,1}, ::UnitRange{Int64}, ::Int64, ::Int64, ::Int64) at ./arrayshow.jl:68
 [3] print_matrix(::IOContext{REPL.Terminals.TTYTerminal}, ::Array{UInt8,1}, ::String, ::String, ::String, ::String, ::String, ::String, ::Int64, ::Int64) at ./arrayshow.jl:186
 [4] print_matrix at ./arrayshow.jl:159 [inlined]
 [5] print_array at ./arrayshow.jl:308 [inlined]
 [6] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::Array{UInt8,1}) at ./arrayshow.jl:345
...
  • no exception is thrown and the array can be read back

I have originally suspected segfault in CodecZlib with libz, however it was only the first thing that has touched the array after building the FlatBuffer bytes and therefore the getindex was called there.

The definition of the FlatBuffers, that I use, is following (I have renamed entries but types and structure are the same)

mutable struct Content
    name::String
    os::Vector{String}
    types::Vector{UInt8}
    ss::Vector{String}
end

mutable struct MyFB
	id::String
	s_id::Vector{Int64}
	b_id::Vector{Int64}
	c::Vector{Content}
end

mutable struct MyFBGrouped
       group::Vector{MyFB}
end

In my code I extract data to create MyFB and after that I group them into the MyFBGrouped to reduce the number of files that are created. I have been seeing a few crashes in building MyFB, however I could not isolate the issue to one specific file, because it can just work on the second try and running it through 1M files multiple times is expensive. I was able to successfully reproduce the behavior with the grouping into MyFBGrouped, where I found 100 MyFBs, for which the segfault is triggered in almost 100% of cases (stacktraces above, were generated by it). Unfortunately I am not able to share them, however I am eager to help debug this issue as it is a really high priority for me.

Julia v0.7.0
FlatBuffers v0.4.0

janfrancu avatar Nov 19 '18 12:11 janfrancu

Could you try on the flatc branch and let me know if the issue is still present?

rjkat avatar Nov 19 '18 20:11 rjkat

I've just tried it out on the flatc branch and it seems to resolve the problem. I will do some further tests once I get to work tomorrow.

janfrancu avatar Nov 19 '18 20:11 janfrancu

After some test, I have two notes about the branch and the issue:

  • Firstly, I have unfortunately found another segfault case, around the same size as last time. The interesting thing I have noticed that if I group files[1:end-1] instead of files[1:end], where files are the MyFB FlatBuffers, it does not crash. This lead me to trying just the last two files[end-2:end], which however does not crash, so it somehow needs the preceding bytes to crash... The same behavior can be reproduced on the last tagged version v0.4.0, where in the first case I have to take 4 files from the back of the array (files[1:end-5]) just to get it working and only one file from the second case. Note that in both of these cases the resulting size of the FlatBuffer bytes drops under 75MB after the last removal, however that might be just a coincidence.
  • Secondly, there is a performance regression in the flatc branch, where reading of MyFB takes 6x longer than on v0.4.0 and building takes 10x longer. This is mainly caused by the increased number of allocations and should be hopefully reproducible with a general test (personally have not yet tested on some dummy data). The table has been generated using one sample of about 2MB in size and using the @btime macro.
                   load one MyFB                build one MyFB
flatc branch     0.6s (4.3M allocations)    0.7s (6.3M allocations)
v0.4.0           0.09 (1.9M allocations)    0.07s (1M allocations)

janfrancu avatar Nov 21 '18 08:11 janfrancu

Hey, thanks for the detailed analysis. It's a good point, and we should probably have some performance tests. The performance issue should be easy to reproduce. As for the segfault issue, give me some time to reproduce on my end and I'll see if I can get to the bottom of it.

rjkat avatar Nov 21 '18 23:11 rjkat

I found and fixed an issue which may be the cause of lots of unnecessary allocations - if it's easy for you to run your performance test again on that branch could you do so and post the results here? Thanks!

rjkat avatar Nov 22 '18 07:11 rjkat

Thanks for the effort, unfortunately there is still a lot of allocations. I have made a trace of the allocations for v0.4.0 and for the new flatc branch, separately for both read and build. Hope you can make something out of it.

v0.4.0.zip flatc.zip

Also there is one thing bugging me. Every time I pull the flatc branch and run it on 0.7.0, there is an error with the type keyword in src/macros.jl. I forked it and made some changes, though maybe the change from type to typ might be more desirable, than the one that I made.

janfrancu avatar Nov 22 '18 09:11 janfrancu

Fair! I haven't tested compatibility with julia 0.7. I'll be sure to make sure it works before merging this pull request.

rjkat avatar Nov 24 '18 01:11 rjkat

This thread is turning into a bit of a discussion about PR #33 rather than discussion of the issue itself. I ran some of my own profiling this time and I think I managed to improve the performance a bit more. However, since this change adds a lot of new functionality (vectors of unions, support for deprecated fields, flatc) some regression in performance is inevitable. The question of how much regression is tolerable is subjective, but if it makes the package completely unusable for you then I’d say that’s unacceptable. So the question is now: would you be happy if the package were released from flatc now?

The follow up question is: as it stands, I can’t reproduce this issue. I tried replicating the types given and populating them with some stuff. Can you provide a minimal reproduction example? If this is not possible, can you give approximate numbers/sizes of each of the fields of the given types?

rjkat avatar Nov 24 '18 23:11 rjkat

I am sorry for the late response, I spent last few weeks working around it using Julia built in serialization, however that was also quite slow to my taste. The issue there was that the structure is probably too nested for the serialization to work faster than the FlatBuffer on either of the branches in question. As for the performance regression I would probably like to see how well the package performs against implementation in other languages, for I understand the urge to bring it closer to the flatc functionality and this might be a good measure to weigh in on the decision.

As for the issue at hand, while benchmarking the different methods I managed to create a reproducible example of a crash, however this time during the reading of the FlatBuffer bytes. Here is the sample script, that populates an array with the same MyFB, builds the FlatBuffer bytes, and then reads them.

using FlatBuffers
using Random

## definition of FlatBuffers from the first post here
mutable struct Content
    name::String
    os::Vector{String}
    types::Vector{UInt8}
    ss::Vector{String}
end

mutable struct MyFB
	id::String
	s_id::Vector{Int64}
	b_id::Vector{Int64}
	c::Vector{Content}
end

mutable struct MyFBGrouped
	group::Vector{MyFB}
end


function createMyFB(n)
	function randContent()
		name = randstring(rand(2:7))
		m = rand(1:3)
		os = [randstring(rand(3:20)) for i in 1:m]
		types = UInt8.(rand(0:10, m))
		ss = [randstring(rand(4:20)) for i in 1:rand(0:2)]
		Content(name, os, types, ss)
	end

	id = randstring(10)
	s_id = rand(1:n, n)
	b_id = rand(1:n, n)
	c = [randContent() for i in 1:n]
	MyFB(id, s_id, b_id, c)
end


myfb = createMyFB(10000);
group = MyFBGrouped(fill(myfb, 90));

bytes = FlatBuffers.bytes(FlatBuffers.build!(group));
length(bytes) # the bigger the file the easier it is to crash
readBack = FlatBuffers.read(MyFBGrouped, bytes);

This crashes Julia every time, both on v0.4.0 and flatc, if not, just repeat the read twice or more times until it does. The crash is one of the least informant I've ever seen

signal (11): Segmentation fault: 11
in expression starting at no file:0
_platform_memmove$VARIANT$Haswell at /usr/lib/system/libsystem_platform.dylib (unknown line)
Allocations: 70860507 (Pool: 70858006; Big: 2501); GC: 161

Additionally sometimes on the flatc branch just before the segfault the reading function starts to return the following error

ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type String
Closest candidates are:
  convert(::Type{String}, ::Symbol) at deprecated.jl:53
  convert(::Type{String}, ::Array{UInt8,1}) at deprecated.jl:53
  convert(::Type{String}, ::AbstractArray{Char,1}) at deprecated.jl:53
  ...
Stacktrace:
 [1] Content(::Nothing, ::Nothing, ::Nothing, ::Nothing) at ./REPL[3]:2
 [2] read(::FlatBuffers.Table{Content}, ::Type{Content}) at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:319
 [3] getvalue(::FlatBuffers.Table{MyFB}, ::Int64, ::Type{Content}) at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:262
 [4] getarray(::FlatBuffers.Table{MyFB}, ::Int64, ::Int32, ::Type{Content}) at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:229
 [5] getvalue(::FlatBuffers.Table{MyFB}, ::Int16, ::Type{Array{Content,1}}) at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:239
 [6] read(::FlatBuffers.Table{MyFB}, ::Type{MyFB}) at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:314
 [7] getvalue(::FlatBuffers.Table{MyFBGrouped}, ::Int64, ::Type{MyFB}) at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:262
 [8] getarray(::FlatBuffers.Table{MyFBGrouped}, ::Int64, ::Int32, ::Type{MyFB}) at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:229
 [9] getvalue(::FlatBuffers.Table{MyFBGrouped}, ::Int16, ::Type{Array{MyFB,1}}) at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:239
 [10] read(::FlatBuffers.Table{MyFBGrouped}, ::Type{MyFBGrouped}) at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:314
 [11] read at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:294 [inlined]
 [12] read(::Type{MyFBGrouped}, ::Array{UInt8,1}) at /Users/jfrancu/.julia/packages/FlatBuffers/KWFpP/src/FlatBuffers.jl:325
 [13] top-level scope at none:0

Why I think this is relevant to the building issue is because if I dump those bytes into a file right after reading and try to deserialize them in another Julia session everything is okay. This leads me to believe, that even though the array access does not trigger segfault, the array still seems to be somehow corrupted after its creation, or there is some internal state that affects the deserialization.

janfrancu avatar Dec 12 '18 13:12 janfrancu

sweet! thanks for the reproduction case. I was able to prevent it from crashing by doing this.

using FlatBuffers
using Random

# rjkat: import julia's garbage collector
import Base.GC

## definition of FlatBuffers from the first post here
mutable struct Content
    name::String
    os::Vector{String}
    types::Vector{UInt8}
    ss::Vector{String}
end

mutable struct MyFB
	id::String
	s_id::Vector{Int64}
	b_id::Vector{Int64}
	c::Vector{Content}
end

mutable struct MyFBGrouped
	group::Vector{MyFB}
end


function createMyFB(n)
	function randContent()
		name = randstring(rand(2:7))
		m = rand(1:3)
		os = [randstring(rand(3:20)) for i in 1:m]
		types = UInt8.(rand(0:10, m))
		ss = [randstring(rand(4:20)) for i in 1:rand(0:2)]
		Content(name, os, types, ss)
	end

	id = randstring(10)
	s_id = rand(1:n, n)
	b_id = rand(1:n, n)
	c = [randContent() for i in 1:n]
	MyFB(id, s_id, b_id, c)
end


myfb = createMyFB(10000);
group = MyFBGrouped(fill(myfb, 90));
bytes = FlatBuffers.bytes(FlatBuffers.build!(group));
length(bytes) # the bigger the file the easier it is to crash
# rjkat: temporarily disable garbage collection
GC.enable(false)
readBack = FlatBuffers.read(MyFBGrouped, bytes);
# rjkat: turn garbage collection back on again
GC.enable(true)

I suspect it is a bug in Julia's garbage collector, or the way we are interacting with it in FlatBuffers. For now use this as a workaround, I'll look at disabling garbage collection as part of the FlatBuffers read() and deserialize() functions in a later release.

rjkat avatar Dec 12 '18 14:12 rjkat

Thanks for the quick response. It is quite a radical workaround disabling the GC, however I can confirm, that it also remedies the main issue, i.e. I am not able to reproduce the corrupted arrays with GC disabled for the whole script, so that's somewhat good news. Hope this can be fixed soon.

janfrancu avatar Dec 12 '18 20:12 janfrancu