CategoricalArrays.jl icon indicating copy to clipboard operation
CategoricalArrays.jl copied to clipboard

Serialization

Open alyst opened this issue 9 years ago • 13 comments

Was the performance/storage efficiency of [Nullable]CategoricalArray serialization checked? Would it make sense to override serialize()/deserialize() for CategoryPool (invindex and especially valindex fields could be reconstructed from index)?

alyst avatar Mar 07 '17 12:03 alyst

Here's a little benchmark using Julia 0.5.0, NullableArrays 0.1.0 CategoricalArrays 0.1.2

using StatsBase, CategoricalArrays, NullableArrays

arr = sample(["A", "B", "C", "D", "E", "F", "G", "H", Nullable()], 10^7);
open("arr.jlser", "w") do io serialize(io, arr) end
open("null_arr.jlser", "w") do io serialize(io, NullableArray(arr)) end
open("null_cat_arr.jlser", "w") do io serialize(io, categorical(arr)) end

map(filesize, ["arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
# 3-element Array{Int64,1}:
# 266664515
#  38641606
#  40000939

It looks strange that categorical is that big.

alyst avatar Mar 07 '17 13:03 alyst

Here's the deserialization results. Unfortunately, I wasn't able to install BenchmarkTools to an official Julia docker container, so the actual numbers might be different, but the trend is here:

@time open("arr.jlser", "r") do io deserialize(io) end
# 58.859399 seconds (211.15 M allocations: 8.504 GB, 21.28% gc time)
@time open("arr.jlser", "r") do io deserialize(io) end;
# 82.845217 seconds (211.11 M allocations: 8.502 GB, 43.56% gc time)

@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 39.956336 seconds (111.14 M allocations: 4.936 GB, 55.29% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 23.724424 seconds (111.11 M allocations: 4.935 GB, 25.49% gc time)

@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.225701 seconds (106.56 k allocations: 42.491 MB)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.092174 seconds (1.97 k allocations: 38.244 MB)

Array{Nullable{String}} times are bad, but not very surprising. But poor NullableArray{String} performance is sad.

alyst avatar Mar 07 '17 13:03 alyst

Interesting. I would also compare with Array{String}, which is really the natural reference for NullableArray{String} and (Nullable)CategoricalArray{String}.

The filesize for the categorical array (40000939) is exactly what I would expect: 4 bytes per element plus some negligible overhead. It's hard to be more efficient than that. I'm not sure how NullableArray does to go below that limit: even just storing a pointer to a pooled string would take more space. Are you sure the array is correctly restored after deserialization?

As for possible optimizations, we could rebuild invindex and valindex from index and only save the latter. That would make little difference when the number of levels is small compared with the number of elements, though. Anyway we will need custom serialization rules if we store level hashes (https://github.com/JuliaData/CategoricalArrays.jl/pull/61), since IIRC hashes may include a random component changing across sessions in the future. If we do that, we may as well implement all possible optimizations.

nalimilan avatar Mar 07 '17 15:03 nalimilan

Ah, right, categorical array uses UInt32 by default. Could it be worth to use more compact integer type for serializing? I guess NullableArray occupies less because the serialization of #undef is compact.

Here's the results for Vector{String}:

arr2 = sample(["A", "B", "C", "D", "E", "F", "G", "H", "G"], 10^7);
open("arr2.jlser", "w") do io serialize(io, arr2) end
map(filesize, ["arr2.jlser", "arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
#4-element Array{Int64,1}:
#  40000012
# 266663942
#  38640383
#  40000939

@time open("arr2.jlser", "r") do io deserialize(io) end
# 39.064836 seconds (110.00 M allocations: 5.290 GB, 47.89% gc time)
@time open("arr2.jlser", "r") do io deserialize(io) end
# 36.619610 seconds (110.00 M allocations: 5.290 GB, 43.79% gc time)

I guess there is a room for improvement in Base. I tried to use serialization mechanisms to load/save moderately sized data frames (~5.5GB gzipped), and loading is terribly slow.

alyst avatar Mar 07 '17 15:03 alyst

Ah, right, categorical array uses UInt32 by default. Could it be worth to use more compact integer type for serializing?

Yeah, I guess we could use the same logic as in compress to choose the smallest possible type. We would need to also save the original type so that we can restore the same array, though (else adding new levels could fail). Julia probably saves the type of the object somewhere already.

I guess NullableArray occupies less because the serialization of #undef is compact.

There should be only about 11% of nulls in the vector, so that can explain the difference compared with Array{String}, but not the fact that only 4 bytes are used per element. That could indicate that Base does something clever to save duplicated strings, but looking at the code that does not seem to be the case...

Also it's really weird that deserializing Array{String} is slower (on second run) than for NullableArray{String}...

nalimilan avatar Mar 07 '17 16:03 nalimilan

Could it be worth to use more compact integer type for serializing?

OTOH, since serialization tries to restore the references, it's better to keep the serialization format as close to the type as possible. Gzipping should take care of it automatically:

gzopen("arr.jlser.gz", "w") do io serialize(io, arr) end
gzopen("arr2.jlser.gz", "w") do io serialize(io, arr2) end
gzopen("null_arr.jlser.gz", "w") do io serialize(io, NullableArray(arr)) end
gzopen("null_cat_arr.jlser.gz", "w") do io serialize(io, categorical(arr)) end

map(filesize, ["arr2.jlser.gz", "arr.jlser.gz", "null_arr.jlser.gz", "null_cat_arr.jlser.gz"])
#4-element Array{Int64,1}:
# 5801136
# 8686152
# 7389933
# 6254812

There should be only about 11% of nulls in the vector, so that can explain the difference compared with Array{String}, but not the fact that only 4 bytes are used per element.

The contents of null_arr.jlser look like this

E&|B&|B&|E&|A&|G&|A$&|D&|D&|F&|F&|A&|B$$&|D&|F&|B&|F&|E&|F&|E$&|F&|B&|D&|B$&|A&|E&|G&|E&|E&|C&|H&|E&|F$&|C$&|C&|C$&|A&|H&|A&|H&|E$&|H&|E&|C&|C&|H&|C&|C&|H&|G&|B&|D&|G$$&|F&|E&|H&|H&|B&|A&|D&|B&|H&|E&|F&|C&|G&|B&|C&|B&|D&|C$&|B&

In hex

45 26 15 7C 41 26 15 7C 46 26 15 7C 43 26 15 7C 48 26 15 7C 41 24 26 15 7C 41 26 15 7C 46 26 15 7C 42 26 15 7C 47 26 15 7C 42 26 15 7C 47 24 26 15 7C 48 26 15 7C 46 24 26 15 7C 46 26 15 7C 44 26 15 7C 43 26 15 7C 43 26 15 7C 41 26 15 7C 42 26 15 7C 46 26 15 7C 47 26 15 7C 45 26 15 7C 46 26 15

alyst avatar Mar 07 '17 16:03 alyst

Of course, that's because your strings are only one character, so 1 byte, plus apparently 3 bytes of overhead. With longer level names, the story would be quite different!

nalimilan avatar Mar 07 '17 16:03 nalimilan

Relevant upstream issue JuliaLang/julia#18633

alyst avatar Mar 07 '17 17:03 alyst

The representation of String has changed in 0.6, maybe it improved (de)serialization performance?

nalimilan avatar Mar 07 '17 17:03 nalimilan

Had not tested v0.6, but Vector{String} is improved in v0.5.1:

using StatsBase, CategoricalArrays, NullableArrays

arr = sample(["A", "B", "C", "D", "E", "F", "G", "H", Nullable()], 10^7);
open("arr.jlser", "w") do io serialize(io, arr) end;
open("null_arr.jlser", "w") do io serialize(io, NullableArray(arr)) end;
open("null_cat_arr.jlser", "w") do io serialize(io, categorical(arr)) end;
arr2 = sample(["A", "B", "C", "D", "E", "F", "G", "H", "G"], 10^7);
open("arr2.jlser", "w") do io serialize(io, arr2) end;

map(filesize, ["arr2.jlser", "arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
#4-element Array{Int64,1}:
#  40000012
# 266670491
#  38643340
#  40000939

@time open("arr.jlser", "r") do io deserialize(io) end;
# 83.365617 seconds (127.84 M allocations: 4.879 GB, 7.27% gc time)
@time open("arr.jlser", "r") do io deserialize(io) end;
# 94.405233 seconds (127.78 M allocations: 4.876 GB, 16.81% gc time)

@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 16.028068 seconds (57.82 M allocations: 1.758 GB, 39.62% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 12.801466 seconds (57.78 M allocations: 1.756 GB, 26.86% gc time)

@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.271359 seconds (106.03 k allocations: 42.476 MB)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.034357 seconds (1.73 k allocations: 38.231 MB)

@time open("arr2.jlser", "r") do io deserialize(io) end;
# 11.650630 seconds (50.00 M allocations: 1.714 GB, 14.06% gc time)
@time open("arr2.jlser", "r") do io deserialize(io) end;
# 12.030219 seconds (50.00 M allocations: 1.714 GB, 14.30% gc time)

alyst avatar Mar 07 '17 23:03 alyst

Finally, v0.6 using the official nightly. Everything is improved, although with the exception of categorical arrays, there's still considerable time/memory overhead.

arr = sample(["A", "B", "C", "D", "E", "F", "G", "H", Nullable()], 10^7);
open("arr.jlser", "w") do io serialize(io, arr) end;
open("null_arr.jlser", "w") do io serialize(io, NullableArray(arr)) end;
open("null_cat_arr.jlser", "w") do io serialize(io, categorical(arr)) end;
arr2 = sample(["A", "B", "C", "D", "E", "F", "G", "H", "G"], 10^7);
open("arr2.jlser", "w") do io serialize(io, arr2) end;

map(filesize, ["arr2.jlser", "arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
#4-element Array{Int64,1}:
# 100000012
# 320003354
#  91977889
#  40001083

@time open("arr.jlser", "r") do io deserialize(io) end;
# 21.946035 seconds (89.01 M allocations: 3.474 GiB, 15.27% gc time)
@time open("arr.jlser", "r") do io deserialize(io) end;
# 26.444456 seconds (88.89 M allocations: 3.469 GiB, 26.98% gc time)

@time open("null_arr.jlser", "r") do io deserialize(io) end;
#  7.818007 seconds (48.89 M allocations: 1.094 GiB, 35.74% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
#  5.232938 seconds (48.89 M allocations: 1.094 GiB, 13.68% gc time)

@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.148508 seconds (33.76 k allocations: 39.701 MiB)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.031035 seconds (2.04 k allocations: 38.254 MiB)

@time open("arr2.jlser", "r") do io deserialize(io) end;
#  5.172547 seconds (40.00 M allocations: 991.886 MiB, 15.52% gc time)
@time open("arr2.jlser", "r") do io deserialize(io) end;
#  4.753201 seconds (40.00 M allocations: 991.886 MiB, 12.20% gc time)

alyst avatar Mar 08 '17 16:03 alyst

Much better indeed. Could be useful to post the relevant timings on the Julia issue too.

I'm not sure what other software could be used to compare serialization performance. Maybe Feather and HDF5?

nalimilan avatar Mar 08 '17 17:03 nalimilan

Also R's load/save() serialization. It's written in C, so should be quite fast.

alyst avatar Mar 08 '17 17:03 alyst