daru Use of broadcasting data strucutures for daru internals.

Currently, daru uses Arrays for storing data inside Vectors, which are collectively stored inside a dataframe.

However, this approach reduces the speed of most mathematical operations due to everything being a Ruby object and all the looping operations happening in Ruby.

I would like to explore alternatives to re-implementing daru's internal data structures in something like NMatrix or Numo::NArray for more efficient storage of data. @genya0407 's sake gem does this to some extent but it is still not as widespread as that of pandas.

This will most probably make use of broadcasting data structures. In the interest of speed, do you all think it would be alright to sacrifice compatibility with JRuby? Since NMatrix has a Java backend, how about implementing broadcasting in NMatrix and rewriting daru's internals using NMatrix?

Please pitch in your ideas into this thread.

CC: @mrkn @zverok @genya0407 @gnilrets @lokeshh @kozo2

Mar 31 '17 10:03 v0dro

Could you clarify what is meant by "broadcasting data structures"?

I've recently been playing around a bit with PyArrow and so far it seems like a performant internal data structure.

https://pyarrow.readthedocs.io/en/latest/ https://www.slideshare.net/wesm/python-data-wrangling-preparing-for-the-future https://github.com/SciRuby/daru/issues/164

If it were integrated into Daru, it may open open Daru to more of the Apache "Big Data" ecosystem, which would be nice. But I have no experience integrating C with Ruby, so don't really know what kind of effort this would require.

Mar 31 '17 17:03 gnilrets

Broadcasting would basically involve changing the internal data structures in such a way that they are more efficient and reduce copying of data whenever possible. For example, pandas uses numpy internally, which supports broadcasting and hence makes pandas fast.

This will mainly involve choosing an appropriate matrix library like Numo::NArray or NMatrix and integrating them with daru. Some changes might be required in the matrix libraries to fully support broadcasting.

Apr 01 '17 10:04 v0dro

FYI:

Apache Arrow is implementing Tensor object:

https://www.slideshare.net/wesm/memory-interoperability-in-analytics-and-machine-learning#23
https://github.com/apache/arrow/pull/438

Pandas will use it in 2.0:

https://www.slideshare.net/wesm/memory-interoperability-in-analytics-and-machine-learning#25
https://github.com/apache/arrow/pull/477

I'm working on Ruby bindings of Apache Arrow. They are already included in Apache Arrow partially:

https://github.com/apache/arrow/tree/master/c_glib
https://github.com/kou/red-arrow

Apr 03 '17 00:04 kou

@kou do you think we should leapfrog to using Apache Arrow Tensor directly for internal storage? I am seriously considering an overhaul of the daru storage infrastructure given the speed bottlenecks caused by creation of Ruby objects.

If it can be done in a transparent and dependency-free manner, can you please elaborate on how we can proceed for implementing this in daru?

Aug 28 '17 07:08 v0dro

@mrkn if you have experience with arrow can you please shed some light on this?

Aug 28 '17 07:08 v0dro

daru can use NMatrix or Numo::NAarray for internal object. Red Data Tools project provides libraries to convert them with low cost via Apache Arrow:

Red Arrow NMatrix provides NMatrix#to_arrow and Arrow::Tensor#to_nmatrix.
Red Arrow Numo::NArray provides Numo::*#to_arrow and Arrow::Tensor#to_narray.

You can convert NMatrix and Numo::NArray by the following:

nmatrix.to_arrow.to_narray # => Numo::*
narray.to_arrow.to_nmatrix # => NMatrix

Now, Apache Arrow focuses data format. It doesn't implement data operations yet. They will be implemented after Apache Arrow 1.0.0 is released.

Aug 28 '17 12:08 kou

daru daru copied to clipboard

Use of broadcasting data strucutures for daru internals.

daru
daru copied to clipboard