nmatrix icon indicating copy to clipboard operation
nmatrix copied to clipboard

Improvements for NMatrix::read and NMatrix#write (e.g., compression, yaml)

Open translunar opened this issue 11 years ago • 23 comments

Right now, NMatrix uses C++ STL's iostream to handle binary reading and writing of matrices. That's great and stuff, but I'm not sure how to make it compatible (without several days of research) with Ruby's IO module.

Ideally, we should be able to do things like this:

File.open("matrixfile") do |f|
  f.write(m)
end

or potentially more awkwardly:

File.open("matrixfile") do |f|
  m.write(f)
end

This would also enable us to run it through a Zlib filter and compress or decompress large matrices on the fly.

Other thoughts:

  • We could use zlib.h to simply compress within the C++ code. That's the easiest solution, but is the least Ruby-like. It makes some sense because we're writing binary files here, and shouldn't need to stack multiple matrices in one file. On the other hand, it raises the question of what we do to pickle additional options, such as on types that inherit from NMatrix -- where does YAML information get stored?
  • Ideally, we only want to compress/decompress the data portion of the matrix. At the very least, the NMatrix version information needs to be in plain binary so NMatrix doesn't have to decompress the whole damn file to find out it's incompatible. That would suck with large matrices.

More later, perhaps.

translunar avatar Jul 09 '13 23:07 translunar

Hi. I have thought about this problem. It is possible to use Marshal objects.

Aelphy avatar Feb 17 '16 16:02 Aelphy

How do you think ?

Aelphy avatar Feb 18 '16 11:02 Aelphy

Can you please provide some more detail?

translunar avatar Feb 18 '16 16:02 translunar

The idea is to convert Nmatrix object to string using Marshal module. It supports operations load and dump. The return of dump operation is a string, that could be written to the file using the following interface:

File.open('matrixfile') do |f|
 f.write(Marshal.dump(m))
end

http://ruby-doc.org/core-2.3.0/Marshal.html

Also it is possible to add compression defining marshal_dump method for NMatrix

Aelphy avatar Feb 18 '16 17:02 Aelphy

Is it written in C, or in pure Ruby? How does it work with C data structures?

translunar avatar Feb 18 '16 21:02 translunar

I am thinking about the following realization: Marshal will call marshal_dump method of NMatrix. Up to this part it is a pure ruby. As far as I understand, inside marshal_dump it is a good idea to call C code that will compress the data and return the string to ruby code. This string will be than an output of marshal_dump routine. This string then will be written to the file using standard ruby interface. Thus c code advantages and ruby interface will be combined.

Aelphy avatar Feb 18 '16 22:02 Aelphy

What do you mean by "compress," exactly?

translunar avatar Feb 19 '16 15:02 translunar

It could be anything. Starting from passing the matrix data through Zlib. Also It is possible to implement one of methods of large matrix compression mentioned here: https://peerj.com/preprints/849.pdf .

Aelphy avatar Feb 19 '16 16:02 Aelphy

In other words any data processing could be implemented on this step.

Aelphy avatar Feb 19 '16 16:02 Aelphy

Have you explored the numpy binary file format?

v0dro avatar Feb 19 '16 17:02 v0dro

Okay. I like this strategy. Go for it.

translunar avatar Feb 19 '16 18:02 translunar

@v0dro, not yet. Do you think, it is a good idea to follow the same format ? I understand that it could be useful in order to load numpy matrices, but generally I think that it is better to implement a separate method for reading numpy matrices.

Aelphy avatar Feb 19 '16 19:02 Aelphy

@mohawkjohn, ok.

Aelphy avatar Feb 19 '16 19:02 Aelphy

One more question. Can this task be a part of GSOC 2016 proposal ?

Aelphy avatar Mar 16 '16 04:03 Aelphy

Also, I just have checked, that definition of marshal_dump method right in NMatrix class results in correct work of Marshal.dump. Now I am thinking about the place, where to define this method, because I will need the access to the Matrix data inside it.

Aelphy avatar Mar 16 '16 05:03 Aelphy

I have a doubt. If you're going to implement a compression algorithm for storing matrices, how can one seek to a particular element directly and then read a given number of elements from that point onward?

That functionality would be important since very large matrices cannot be stored in memory and often users will want to read off a part of it from persistent storage.

v0dro avatar Mar 16 '16 07:03 v0dro

I agree, it is the problem. For the moment I want to implement the interface first (without compression).

Aelphy avatar Mar 16 '16 09:03 Aelphy

Does NMatrix support partial read and write from disk (the problem you have described) ? Or you are talking about the problem to work with compressed matrix written on disk without NMatrix ?

Aelphy avatar Mar 16 '16 09:03 Aelphy

I'm concerned whether your compressed matrix will work for file seeking. Hence I don't think storing in a compressed file format is a very good idea in the first place.

Partial read/write i do not think is supported. You can add that as part of this issue.

v0dro avatar Mar 16 '16 10:03 v0dro

I have found the article, that describes seekable compression using zlib http://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files/. I think that it could be interesting to try this approach.

I think that it is a good idea to implement partial read and write and I will think about its realization.

How do you think, is it enough for GSOC 2016 proposal ?

Aelphy avatar Mar 16 '16 10:03 Aelphy

No, partial read and write are not supported. But I'm not sure a use-case supports this.

When you say "is it enough," do you mean for the code contribution requirement, or for a full summer of work?

translunar avatar Mar 16 '16 13:03 translunar

@mohawkjohn, I am talking about the code contribution.

Aelphy avatar Mar 16 '16 14:03 Aelphy

Yes. I think any non-trivial contribution — to show you understand the codebase and how open source development works — is sufficient.

With that said, we look very kindly on people who contribute more code. =)

translunar avatar Mar 18 '16 17:03 translunar