daru
daru copied to clipboard
Use of broadcasting data strucutures for daru internals.
Currently, daru uses Arrays for storing data inside Vectors, which are collectively stored inside a dataframe.
However, this approach reduces the speed of most mathematical operations due to everything being a Ruby object and all the looping operations happening in Ruby.
I would like to explore alternatives to re-implementing daru's internal data structures in something like NMatrix or Numo::NArray for more efficient storage of data. @genya0407 's sake gem does this to some extent but it is still not as widespread as that of pandas.
This will most probably make use of broadcasting data structures. In the interest of speed, do you all think it would be alright to sacrifice compatibility with JRuby? Since NMatrix has a Java backend, how about implementing broadcasting in NMatrix and rewriting daru's internals using NMatrix?
Please pitch in your ideas into this thread.
CC: @mrkn @zverok @genya0407 @gnilrets @lokeshh @kozo2
Could you clarify what is meant by "broadcasting data structures"?
I've recently been playing around a bit with PyArrow and so far it seems like a performant internal data structure.
https://pyarrow.readthedocs.io/en/latest/ https://www.slideshare.net/wesm/python-data-wrangling-preparing-for-the-future https://github.com/SciRuby/daru/issues/164
If it were integrated into Daru, it may open open Daru to more of the Apache "Big Data" ecosystem, which would be nice. But I have no experience integrating C with Ruby, so don't really know what kind of effort this would require.
Broadcasting would basically involve changing the internal data structures in such a way that they are more efficient and reduce copying of data whenever possible. For example, pandas uses numpy internally, which supports broadcasting and hence makes pandas fast.
This will mainly involve choosing an appropriate matrix library like Numo::NArray or NMatrix and integrating them with daru. Some changes might be required in the matrix libraries to fully support broadcasting.
FYI:
Apache Arrow is implementing Tensor object:
- https://www.slideshare.net/wesm/memory-interoperability-in-analytics-and-machine-learning#23
- https://github.com/apache/arrow/pull/438
Pandas will use it in 2.0:
- https://www.slideshare.net/wesm/memory-interoperability-in-analytics-and-machine-learning#25
- https://github.com/apache/arrow/pull/477
I'm working on Ruby bindings of Apache Arrow. They are already included in Apache Arrow partially:
- https://github.com/apache/arrow/tree/master/c_glib
- https://github.com/kou/red-arrow
@kou do you think we should leapfrog to using Apache Arrow Tensor directly for internal storage? I am seriously considering an overhaul of the daru storage infrastructure given the speed bottlenecks caused by creation of Ruby objects.
If it can be done in a transparent and dependency-free manner, can you please elaborate on how we can proceed for implementing this in daru?
@mrkn if you have experience with arrow can you please shed some light on this?
daru can use NMatrix or Numo::NAarray for internal object. Red Data Tools project provides libraries to convert them with low cost via Apache Arrow:
- Red Arrow NMatrix provides
NMatrix#to_arrowandArrow::Tensor#to_nmatrix. - Red Arrow Numo::NArray provides
Numo::*#to_arrowandArrow::Tensor#to_narray.
You can convert NMatrix and Numo::NArray by the following:
nmatrix.to_arrow.to_narray # => Numo::*
narray.to_arrow.to_nmatrix # => NMatrix
Now, Apache Arrow focuses data format. It doesn't implement data operations yet. They will be implemented after Apache Arrow 1.0.0 is released.