numo-narray icon indicating copy to clipboard operation
numo-narray copied to clipboard

Reading files into NArray

Open railsmk opened this issue 2 years ago • 10 comments

Hello,

I have asked a question at stackoverflow concerning reading files into Numo::NArray and dynamically inserting rows/data. Would you care to show me the way or tell me if the thing I'm trying to do is even possible to accomplish with this library?

https://stackoverflow.com/questions/68282417/reading-files-into-ruby-numonarray

railsmk avatar Jul 07 '21 08:07 railsmk

Hello. I think it depends on the size of the data.

If the data is small, you can read a text file and convert the ruby array to Numo::NArray with cast method. For example,

Input file:

3 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

ruby script.

require 'numo/narray'

na = nil

File.open("input.txt", "r") do |f|
  a, b = f.gets.split("\n").map(&:to_i)
  arr = f.gets.split(" ").map(&:to_i)
  na = Numo::UInt8.cast(arr).reshape(a, b)
end

p na

https://github.com/ruby-numo/numo-narray/issues/132

Of course, you can also use the CSV library.

If the input is in binary format, use from_string to convert a binary string to NArray. https://github.com/ruby-numo/numo-narray/issues/140

Numo::NArray is not flexible enough to change its size. Methods such as hstack, vstack, and dstack look as if they can change the shape. However, they are actually generating a new NArray. When inserting and increasing the number of rows, NArray needs to allocate new memory. You would need to create a new NArray each time.

If you want to deal with really huge data, Apache Arrow may be useful.

Currently, Apache Arrow is the fastest way to convert CSV files to Ruby data, with much better performance than the standard CSV library.

See https://github.com/red-data-tools/red-arrow-numo-narray

kojix2 avatar Jul 07 '21 09:07 kojix2

Or do you need an alternative to numpy's fromfile? https://numpy.org/doc/stable/reference/generated/numpy.fromfile.html

kojix2 avatar Jul 07 '21 09:07 kojix2

Thank you for your reply.

Yes, I'm searching for a way to load huge data in rails/ruby application. Small data would also be used as it all depends on user input. User can upload any file type, it gets splitted and data is loaded in binary. Data must be loaded to matrix (every file in a different row) and then I need to perform few operations (multiplications and inversions) with smaller helper matrices. Crucial thing is that I have to modify library so it works with Galois fields. I have done this with old Narray. Math operations performance was ok I can say, but the time that it took to load data was unacceptable.

I never used numpy but I think there would be a problem as I have to combine few files into one array/matrix. I will try red-arrow lib, let's see if it works.

Thank you once again, I would appreciate if you could maybe think of something else you can recommend as I specified my goal.

railsmk avatar Jul 07 '21 10:07 railsmk

Also I can specify that files that user can upload won't be larger than 8-16 GB. Will ruby script you posted at the beginning be enough to achieve fast results? I guess that's not considered as "large" data.

I tried loading data with built-in ruby array at my first approach to the algorithm and it wasn't enough so I guess I will have to use something else

railsmk avatar Jul 07 '21 10:07 railsmk

I see. Big files, over several gigabytes... If so, it would be better to create a binary string somehow and use the store_binary or from_binary method to create NArray. If your input file is TSV or CSV, Apache Arrow + red-arrow-numo-narray is good choice. However, even for other files, you may be able to speed up loading if you get a way to create a binary string from your files. You don't need to use Ruby script to create a binary string. It will be faster if you use a fast executable. That's about all I know about it. I guess others can answer more detail.

kojix2 avatar Jul 07 '21 11:07 kojix2

Thank you for sharing knowledge. It's first time I'm working with this big data, so you helped a lot as I didn't know anything about it. I'm going to try your recommendations in the following days.

railsmk avatar Jul 07 '21 11:07 railsmk

Good luck with your work. If you are familiar with the C language, you may want to create C extensions to read the files. I don't know much about C, but I think a library called magro might be helpful. https://github.com/yoshoku/magro

https://github.com/yoshoku/magro/blob/2ed598d02f0d9cc52baead28415dbdb8c6883101/ext/magro/imgrw.c#L116-L122

  VALUE nary;
  uint8_t* nary_ptr;
  nary = rb_narray_new(numo_cUInt8, n_dims, shape);
  nary_ptr = (uint8_t*)na_get_pointer_for_write(nary);


  for (y = 0; y < height; y++) {
    row_ptr = row_ptr_ptr[y];
    memcpy(nary_ptr + y * width * n_ch, row_ptr, width * n_ch);
  }
  return nary;

kojix2 avatar Jul 08 '21 01:07 kojix2

https://github.com/Himeyama/narray-fromfile

kojix2 avatar Jul 22 '21 13:07 kojix2

Thank you for constantly bringing new ideas to the table. Overall the library works fine. Performance is ok for now, especially the loading part directly from binary data works great, better than I expected. I will try fromfile gem soon. I am not yet sure if matrix multiplication performance is going to be sufficient to bring the product to the market, but this part can be done in a different way. I appreciate your involvement, take care.

railsmk avatar Jul 26 '21 06:07 railsmk

@railsmk That's good to know. narray-fromfile is a library created by a university student as a hobby. The implementation is helpful, but you might not want to use it in a production environment. Good luck!

kojix2 avatar Jul 27 '21 06:07 kojix2