nmatrix
nmatrix copied to clipboard
Improvements for NMatrix::read and NMatrix#write (e.g., compression, yaml)
Right now, NMatrix uses C++ STL's iostream
to handle binary reading and writing of matrices. That's great and stuff, but I'm not sure how to make it compatible (without several days of research) with Ruby's IO
module.
Ideally, we should be able to do things like this:
File.open("matrixfile") do |f|
f.write(m)
end
or potentially more awkwardly:
File.open("matrixfile") do |f|
m.write(f)
end
This would also enable us to run it through a Zlib filter and compress or decompress large matrices on the fly.
Other thoughts:
- We could use
zlib.h
to simply compress within the C++ code. That's the easiest solution, but is the least Ruby-like. It makes some sense because we're writing binary files here, and shouldn't need to stack multiple matrices in one file. On the other hand, it raises the question of what we do to pickle additional options, such as on types that inherit from NMatrix -- where does YAML information get stored? - Ideally, we only want to compress/decompress the data portion of the matrix. At the very least, the NMatrix version information needs to be in plain binary so NMatrix doesn't have to decompress the whole damn file to find out it's incompatible. That would suck with large matrices.
More later, perhaps.
Hi. I have thought about this problem. It is possible to use Marshal objects.
How do you think ?
Can you please provide some more detail?
The idea is to convert Nmatrix object to string using Marshal module. It supports operations load and dump. The return of dump operation is a string, that could be written to the file using the following interface:
File.open('matrixfile') do |f|
f.write(Marshal.dump(m))
end
http://ruby-doc.org/core-2.3.0/Marshal.html
Also it is possible to add compression defining marshal_dump method for NMatrix
Is it written in C, or in pure Ruby? How does it work with C data structures?
I am thinking about the following realization: Marshal will call marshal_dump method of NMatrix. Up to this part it is a pure ruby. As far as I understand, inside marshal_dump it is a good idea to call C code that will compress the data and return the string to ruby code. This string will be than an output of marshal_dump routine. This string then will be written to the file using standard ruby interface. Thus c code advantages and ruby interface will be combined.
What do you mean by "compress," exactly?
It could be anything. Starting from passing the matrix data through Zlib. Also It is possible to implement one of methods of large matrix compression mentioned here: https://peerj.com/preprints/849.pdf .
In other words any data processing could be implemented on this step.
Have you explored the numpy binary file format?
Okay. I like this strategy. Go for it.
@v0dro, not yet. Do you think, it is a good idea to follow the same format ? I understand that it could be useful in order to load numpy matrices, but generally I think that it is better to implement a separate method for reading numpy matrices.
@mohawkjohn, ok.
One more question. Can this task be a part of GSOC 2016 proposal ?
Also, I just have checked, that definition of marshal_dump method right in NMatrix class results in correct work of Marshal.dump. Now I am thinking about the place, where to define this method, because I will need the access to the Matrix data inside it.
I have a doubt. If you're going to implement a compression algorithm for storing matrices, how can one seek to a particular element directly and then read a given number of elements from that point onward?
That functionality would be important since very large matrices cannot be stored in memory and often users will want to read off a part of it from persistent storage.
I agree, it is the problem. For the moment I want to implement the interface first (without compression).
Does NMatrix support partial read and write from disk (the problem you have described) ? Or you are talking about the problem to work with compressed matrix written on disk without NMatrix ?
I'm concerned whether your compressed matrix will work for file seeking. Hence I don't think storing in a compressed file format is a very good idea in the first place.
Partial read/write i do not think is supported. You can add that as part of this issue.
I have found the article, that describes seekable compression using zlib http://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files/. I think that it could be interesting to try this approach.
I think that it is a good idea to implement partial read and write and I will think about its realization.
How do you think, is it enough for GSOC 2016 proposal ?
No, partial read and write are not supported. But I'm not sure a use-case supports this.
When you say "is it enough," do you mean for the code contribution requirement, or for a full summer of work?
@mohawkjohn, I am talking about the code contribution.
Yes. I think any non-trivial contribution — to show you understand the codebase and how open source development works — is sufficient.
With that said, we look very kindly on people who contribute more code. =)