daru Room for improvement with sorting?

Sorting was recently massively improved by @lokeshh (nice work!). However, I wonder if there's still room for some improvement. If I take a dataframe, convert it into an array of rows (array), sort that using sort_by, and then re-convert it into a dataframe, it's about an order of magnitude faster than DataFrame#sort. While it's much faster, this "brute sort" method scales similar to DataFrame#sort, with respect to the number of rows (slightly worse than O(N)). However, "brute sort" doesn't scale as well with respect to the number of columns in the dataframe.

The "brute sort" method would also seem incur a larger memory / garbage collection overhead with the number of intermediate objects that need to be created to make it work. But it's hard to ignore an order of magnitude improvement. Do you see any obvious way to improve the current sorting method even more?

def brute_sort(df)
  df_h = df.to_h
  df_h.each { |k,v| df_h[k] = v.to_a }

  df_a = (0...df.size).reduce([]) do |r, idx|
    row_values = df_h.map { |col, val| val[idx] }
    r << row_values
  end


  df_a_sorted = df_a.sort_by { |v| v[0] }
  df_sorted = Daru::DataFrame.new(df_a_sorted.transpose, order: df.vectors.to_a)
end

base_n = 10000
0.upto(4) do |iscale|
  n = base_n * 2**iscale

  vector = Daru::Vector.new(n.times.map.to_a.shuffle)
  df = Daru::DataFrame.new({
    a: vector,
    b: vector,
    c: vector
  })

  df2 = Daru::DataFrame.new({
    a: vector,
    b: vector,
    c: vector,
    d: vector,
    e: vector,
    f: vector
  })


  Benchmark.bm do |bm|
    bm.report("#{n} - sort - df") do
      df.sort([:a])
    end

    bm.report("#{n} - sort - 2x wide") do
      df2.sort([:a])
    end  

    bm.report("#{n} - brute sort - df") do
      brute_sort(df)
    end  

    bm.report("#{n} - brute sort - 2x wide") do
      brute_sort(df2)
    end  

  end

end

       user     system      total        real
10000 - sort - df  0.570000   0.000000   0.570000 (  0.569620)
10000 - sort - 2x wide  0.610000   0.010000   0.620000 (  0.621069)
10000 - brute sort - df  0.050000   0.000000   0.050000 (  0.047381)
10000 - brute sort - 2x wide  0.090000   0.000000   0.090000 (  0.089599)
       user     system      total        real
20000 - sort - df  1.310000   0.020000   1.330000 (  1.442690)
20000 - sort - 2x wide  1.330000   0.010000   1.340000 (  1.343521)
20000 - brute sort - df  0.100000   0.000000   0.100000 (  0.097531)
20000 - brute sort - 2x wide  0.160000   0.000000   0.160000 (  0.161511)
       user     system      total        real
40000 - sort - df  2.520000   0.010000   2.530000 (  2.529945)
40000 - sort - 2x wide  2.790000   0.010000   2.800000 (  2.809120)
40000 - brute sort - df  0.190000   0.000000   0.190000 (  0.190723)
40000 - brute sort - 2x wide  0.320000   0.000000   0.320000 (  0.326797)
       user     system      total        real
80000 - sort - df  5.510000   0.020000   5.530000 (  5.579667)
80000 - sort - 2x wide  5.860000   0.020000   5.880000 (  5.892969)
80000 - brute sort - df  0.380000   0.000000   0.380000 (  0.389523)
80000 - brute sort - 2x wide  0.750000   0.010000   0.760000 (  0.762585)
       user     system      total        real
160000 - sort - df 11.520000   0.040000  11.560000 ( 11.606709)
160000 - sort - 2x wide 12.570000   0.030000  12.600000 ( 12.630582)
160000 - brute sort - df  0.920000   0.010000   0.930000 (  0.938696)
160000 - brute sort - 2x wide  1.430000   0.010000   1.440000 (  1.433337)

Apr 16 '16 00:04 gnilrets

I remember I made some trade offs which led down the performance of DataFrame#sort by constant factors. To mention some of them:

I had to change to use sort because sort_by didn't worked out when order is reversed. Within sort I pass down array of values and comparison of array is expensive as compared to just values in sort_by.
Further I had to involve array comparison to handle nils. All this led to decrease in performance little by little.

So, I think we'll have to see if brute_force is able to have an edge when incurring all the functionality of handling nils, order, etc.

Apr 16 '16 10:04 lokeshh

I have a wild idea of porting sorting to C for MRI and Java for JRuby. This will be insanely fast, though we'll need to make significant changes to daru and formalize a lot of things like conventions before moving away from pure Ruby implementations. Pandas does this by using numpy or cython in certain strategic places.

Apr 16 '16 17:04 v0dro

Can sorting be implemented in nmatrix and used in Daru?

Apr 17 '16 07:04 lokeshh

Yes that's the plan. But native sorting has not been implemented in nmatrix yet and it will also need to support missing elements.

The issue is open here: https://github.com/SciRuby/nmatrix/issues/454

Apr 17 '16 07:04 v0dro

I looked into the matter lately, and was able to shorten the gap between "brute" and "native" to 4 times (for simplest cases "one column ascending sort"). It is just an experiments, so, there is no place where commit exists :)

But, thinking of it further, I assume even 10-15 seconds for sorting of 160k rows is pretty acceptable. Like, it is not the kind of speed when you became old while waiting for experiment results :)

When I used Daru on real-life cases, there were places when you really can't understand "whether it working" (like, you wait for 5 mins to obtain results) -- and I think that cases are first to be optimized.

May 28 '16 19:05 zverok