numo-narray icon indicating copy to clipboard operation
numo-narray copied to clipboard

Issue with marshal dump

Open seoanezonjic opened this issue 4 years ago • 5 comments

Hi authors of Numo-narray I'm working with biological networks so I need to use matrix operations frecuently. In my project I was using the old Nmatrix library but recently I found your project. Your numpy style and your support to the community convince me to replace Nmatrix by Numo-array in my project. I observed a great improvement in memory, speed and code clarity using your library but i have problems serializing large matrix. With a matrix of 17078x17078 elements i have the following error: in dump: long too big to dump (TypeError) I will provide a context for the error. In my pipeline, the first step is take a network defined by pairs and transform it in a Numo::Narray object and then write to disk with: File.binwrite(options[:output_matrix_file], Marshal.dump(matrix)) in this way I can do several different executions without waste time in the plain text to matrix transformation. I read my serialized matrix and perform the operation I'm interested in. The plain text to matrix transformation is done with the following code:

def add_pair(node_a, node_b, weight, connections)
	query = connections[node_a]
	if !query.nil?
		query[node_b] = weight
	else
		subhash = Hash.new(0.0)
		subhash[node_b] = weight
		connections[node_a] = subhash
	end
end

connections = {}
source.each do |line|
	node_a, node_b, weight = line.chomp.split("\t")
	weight.nil? ? weight = 1.0 : weight = weight.to_f
	add_pair(node_a, node_b, weight, connections)
	add_pair(node_b, node_a, weight, connections)
end
names = connections.keys
matrix = Numo::DFloat.zeros(names.length, names.length)
connections.each do |nodeA, subhash|
	index_A = names.index(nodeA)
	subhash.each do |nodeB, weight|
		index_B = names.index(nodeB)
		matrix[index_A, index_B] = weight
	end
end	

Source is an IO object with the plain text file and my strategy is build a hash with the pair relations to use it in the matrix filling process. I attach the data to create this matrix in the following link : https://www.dropbox.com/s/vqmzkazag3m3fgz/pairs.tar.gz?dl=0

This was executed with numo-narray (0.9.1.5) and suse openSUSE 12.3 (x86_64) VERSION = 12 PATCHLEVEL = 3 CODENAME = Malachite

Thank you in advance Pedro Seoane

seoanezonjic avatar Nov 12 '19 11:11 seoanezonjic

This error comes from the implementation of Ruby. Ruby fails in marshaling a string with >= 2 GiB.

$ ruby -e 'Marshal.dump(" "*2**31)'
Traceback (most recent call last):
	1: from -e:1:in `<main>'
-e:1:in `dump': long too big to dump (TypeError)

NArray depends on this limitation since it handles binary data as a Ruby string. Hmm...

masa16 avatar Nov 13 '19 08:11 masa16

I was digging in the Marshal class code and in this line to line 321 it's described the problem. The constant SIZEOF_LONG is hardcoded checked with a limit of 4 when this constant takes 8 as value in my current ruby installation. i don't now why ruby authors do this check. When I realized that the serializing problem is in fact a ruby core problem I started to search another serializing method. I found the gem npy that works with Numo-Narray. This gem save the matrix using the numpy format, so it is possible pass data from ruby to python and viceversa and the most important, it don't use the Marshal ruby class. I have done several attemps and all of them worked perfectly. I don't know if you want close this issue or improve Narray serialization with this gem in order to handle very large matrix (I think that it would be a great feature). Thank you very much by your attention. Pedro Seoane

seoanezonjic avatar Nov 13 '19 09:11 seoanezonjic

Hi @seoanezonjic Your report is very interesting. How about reporting it to Ruby core team? https://bugs.ruby-lang.org/projects/ruby/wiki/HowToReport

kojix2 avatar Nov 14 '19 23:11 kojix2

Hi @kojix2 I have reported to Ruby core team but it seems and old problem. There was an issue related with the class time serialization that solved patching the time class code but not the marshal code. So, i'll wait to Ruby core team answer and I'll continue using Npy to serialize the matrix . Thank you by your attention

seoanezonjic avatar Nov 18 '19 11:11 seoanezonjic

Here is how python is doing this. They have a new pickle version that uses their buffer protocol (memoryview) https://peps.python.org/pep-0574/

tried to explain it in this rubybugs prop0osal. https://bugs.ruby-lang.org/issues/17685

They also have a new shared memory module that uses memory views. https://docs.python.org/dev/library/multiprocessing.shared_memory.html

dsisnero avatar Aug 25 '22 15:08 dsisnero