nmatrix Improvements for NMatrix::read and NMatrix#write (e.g., compression, yaml)

Right now, NMatrix uses C++ STL's iostream to handle binary reading and writing of matrices. That's great and stuff, but I'm not sure how to make it compatible (without several days of research) with Ruby's IO module.

Ideally, we should be able to do things like this:

File.open("matrixfile") do |f|
  f.write(m)
end

or potentially more awkwardly:

File.open("matrixfile") do |f|
  m.write(f)
end

This would also enable us to run it through a Zlib filter and compress or decompress large matrices on the fly.

Other thoughts:

We could use zlib.h to simply compress within the C++ code. That's the easiest solution, but is the least Ruby-like. It makes some sense because we're writing binary files here, and shouldn't need to stack multiple matrices in one file. On the other hand, it raises the question of what we do to pickle additional options, such as on types that inherit from NMatrix -- where does YAML information get stored?
Ideally, we only want to compress/decompress the data portion of the matrix. At the very least, the NMatrix version information needs to be in plain binary so NMatrix doesn't have to decompress the whole damn file to find out it's incompatible. That would suck with large matrices.

More later, perhaps.

Jul 09 '13 23:07 translunar

Hi. I have thought about this problem. It is possible to use Marshal objects.

Feb 17 '16 16:02 Aelphy

How do you think ?

Feb 18 '16 11:02 Aelphy

Can you please provide some more detail?

Feb 18 '16 16:02 translunar

The idea is to convert Nmatrix object to string using Marshal module. It supports operations load and dump. The return of dump operation is a string, that could be written to the file using the following interface:

File.open('matrixfile') do |f|
 f.write(Marshal.dump(m))
end

http://ruby-doc.org/core-2.3.0/Marshal.html

Also it is possible to add compression defining marshal_dump method for NMatrix

Feb 18 '16 17:02 Aelphy

Is it written in C, or in pure Ruby? How does it work with C data structures?

Feb 18 '16 21:02 translunar

I am thinking about the following realization: Marshal will call marshal_dump method of NMatrix. Up to this part it is a pure ruby. As far as I understand, inside marshal_dump it is a good idea to call C code that will compress the data and return the string to ruby code. This string will be than an output of marshal_dump routine. This string then will be written to the file using standard ruby interface. Thus c code advantages and ruby interface will be combined.

Feb 18 '16 22:02 Aelphy

What do you mean by "compress," exactly?

Feb 19 '16 15:02 translunar

It could be anything. Starting from passing the matrix data through Zlib. Also It is possible to implement one of methods of large matrix compression mentioned here: https://peerj.com/preprints/849.pdf .

Feb 19 '16 16:02 Aelphy

In other words any data processing could be implemented on this step.

Feb 19 '16 16:02 Aelphy

Have you explored the numpy binary file format?

Feb 19 '16 17:02 v0dro

Okay. I like this strategy. Go for it.

Feb 19 '16 18:02 translunar

@v0dro, not yet. Do you think, it is a good idea to follow the same format ? I understand that it could be useful in order to load numpy matrices, but generally I think that it is better to implement a separate method for reading numpy matrices.

Feb 19 '16 19:02 Aelphy

@mohawkjohn, ok.

Feb 19 '16 19:02 Aelphy

One more question. Can this task be a part of GSOC 2016 proposal ?

Mar 16 '16 04:03 Aelphy

Also, I just have checked, that definition of marshal_dump method right in NMatrix class results in correct work of Marshal.dump. Now I am thinking about the place, where to define this method, because I will need the access to the Matrix data inside it.

Mar 16 '16 05:03 Aelphy

I have a doubt. If you're going to implement a compression algorithm for storing matrices, how can one seek to a particular element directly and then read a given number of elements from that point onward?

That functionality would be important since very large matrices cannot be stored in memory and often users will want to read off a part of it from persistent storage.

Mar 16 '16 07:03 v0dro

I agree, it is the problem. For the moment I want to implement the interface first (without compression).

Mar 16 '16 09:03 Aelphy

Does NMatrix support partial read and write from disk (the problem you have described) ? Or you are talking about the problem to work with compressed matrix written on disk without NMatrix ?

Mar 16 '16 09:03 Aelphy

I'm concerned whether your compressed matrix will work for file seeking. Hence I don't think storing in a compressed file format is a very good idea in the first place.

Partial read/write i do not think is supported. You can add that as part of this issue.

Mar 16 '16 10:03 v0dro

I have found the article, that describes seekable compression using zlib http://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files/. I think that it could be interesting to try this approach.

I think that it is a good idea to implement partial read and write and I will think about its realization.

How do you think, is it enough for GSOC 2016 proposal ?

Mar 16 '16 10:03 Aelphy

No, partial read and write are not supported. But I'm not sure a use-case supports this.

When you say "is it enough," do you mean for the code contribution requirement, or for a full summer of work?

Mar 16 '16 13:03 translunar

@mohawkjohn, I am talking about the code contribution.

Mar 16 '16 14:03 Aelphy

Yes. I think any non-trivial contribution — to show you understand the codebase and how open source development works — is sufficient.

With that said, we look very kindly on people who contribute more code. =)

Mar 18 '16 17:03 translunar

nmatrix nmatrix copied to clipboard

Improvements for NMatrix::read and NMatrix#write (e.g., compression, yaml)

nmatrix
nmatrix copied to clipboard