Room for improvement with sorting?
Sorting was recently massively improved by @lokeshh (nice work!). However, I wonder if there's still room for some improvement. If I take a dataframe, convert it into an array of rows (array), sort that using sort_by, and then re-convert it into a dataframe, it's about an order of magnitude faster than DataFrame#sort. While it's much faster, this "brute sort" method scales similar to DataFrame#sort, with respect to the number of rows (slightly worse than O(N)). However, "brute sort" doesn't scale as well with respect to the number of columns in the dataframe.
The "brute sort" method would also seem incur a larger memory / garbage collection overhead with the number of intermediate objects that need to be created to make it work. But it's hard to ignore an order of magnitude improvement. Do you see any obvious way to improve the current sorting method even more?
def brute_sort(df)
df_h = df.to_h
df_h.each { |k,v| df_h[k] = v.to_a }
df_a = (0...df.size).reduce([]) do |r, idx|
row_values = df_h.map { |col, val| val[idx] }
r << row_values
end
df_a_sorted = df_a.sort_by { |v| v[0] }
df_sorted = Daru::DataFrame.new(df_a_sorted.transpose, order: df.vectors.to_a)
end
base_n = 10000
0.upto(4) do |iscale|
n = base_n * 2**iscale
vector = Daru::Vector.new(n.times.map.to_a.shuffle)
df = Daru::DataFrame.new({
a: vector,
b: vector,
c: vector
})
df2 = Daru::DataFrame.new({
a: vector,
b: vector,
c: vector,
d: vector,
e: vector,
f: vector
})
Benchmark.bm do |bm|
bm.report("#{n} - sort - df") do
df.sort([:a])
end
bm.report("#{n} - sort - 2x wide") do
df2.sort([:a])
end
bm.report("#{n} - brute sort - df") do
brute_sort(df)
end
bm.report("#{n} - brute sort - 2x wide") do
brute_sort(df2)
end
end
end
user system total real
10000 - sort - df 0.570000 0.000000 0.570000 ( 0.569620)
10000 - sort - 2x wide 0.610000 0.010000 0.620000 ( 0.621069)
10000 - brute sort - df 0.050000 0.000000 0.050000 ( 0.047381)
10000 - brute sort - 2x wide 0.090000 0.000000 0.090000 ( 0.089599)
user system total real
20000 - sort - df 1.310000 0.020000 1.330000 ( 1.442690)
20000 - sort - 2x wide 1.330000 0.010000 1.340000 ( 1.343521)
20000 - brute sort - df 0.100000 0.000000 0.100000 ( 0.097531)
20000 - brute sort - 2x wide 0.160000 0.000000 0.160000 ( 0.161511)
user system total real
40000 - sort - df 2.520000 0.010000 2.530000 ( 2.529945)
40000 - sort - 2x wide 2.790000 0.010000 2.800000 ( 2.809120)
40000 - brute sort - df 0.190000 0.000000 0.190000 ( 0.190723)
40000 - brute sort - 2x wide 0.320000 0.000000 0.320000 ( 0.326797)
user system total real
80000 - sort - df 5.510000 0.020000 5.530000 ( 5.579667)
80000 - sort - 2x wide 5.860000 0.020000 5.880000 ( 5.892969)
80000 - brute sort - df 0.380000 0.000000 0.380000 ( 0.389523)
80000 - brute sort - 2x wide 0.750000 0.010000 0.760000 ( 0.762585)
user system total real
160000 - sort - df 11.520000 0.040000 11.560000 ( 11.606709)
160000 - sort - 2x wide 12.570000 0.030000 12.600000 ( 12.630582)
160000 - brute sort - df 0.920000 0.010000 0.930000 ( 0.938696)
160000 - brute sort - 2x wide 1.430000 0.010000 1.440000 ( 1.433337)
I remember I made some trade offs which led down the performance of DataFrame#sort by constant factors. To mention some of them:
- I had to change to use
sortbecausesort_bydidn't worked out when order is reversed. Within sort I pass down array of values and comparison of array is expensive as compared to just values insort_by. - Further I had to involve array comparison to handle nils. All this led to decrease in performance little by little.
So, I think we'll have to see if brute_force is able to have an edge when incurring all the functionality of handling nils, order, etc.
I have a wild idea of porting sorting to C for MRI and Java for JRuby. This will be insanely fast, though we'll need to make significant changes to daru and formalize a lot of things like conventions before moving away from pure Ruby implementations. Pandas does this by using numpy or cython in certain strategic places.
Can sorting be implemented in nmatrix and used in Daru?
Yes that's the plan. But native sorting has not been implemented in nmatrix yet and it will also need to support missing elements.
The issue is open here: https://github.com/SciRuby/nmatrix/issues/454
I looked into the matter lately, and was able to shorten the gap between "brute" and "native" to 4 times (for simplest cases "one column ascending sort"). It is just an experiments, so, there is no place where commit exists :)
But, thinking of it further, I assume even 10-15 seconds for sorting of 160k rows is pretty acceptable. Like, it is not the kind of speed when you became old while waiting for experiment results :)
When I used Daru on real-life cases, there were places when you really can't understand "whether it working" (like, you wait for 5 mins to obtain results) -- and I think that cases are first to be optimized.