ZombieWriter
ZombieWriter copied to clipboard
"comparison of Float with NaN failed"...and GSL is Installed
While trying to fix an unrelated issue, I experimented with the code from #5, but using ZombieWriter::MachineLearning rather than ZombieWriter::Randomization.
zombie = ZombieWriter::MachineLearning.new
zombie.add_string(content: "This is filler text that I invented.This is also a paragraph that could be used")
zombie.add_string(content: "This post is amazing. Please take a look")
zombie.add_string(content: "For all sports fan, you must watch this video. Hey you have to check this out.")
array = zombie.generate_articles
p array
#/Users/tariqali/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/kmeans-clusterer-0.11.4/lib/kmeans-clusterer.rb:237:in `sort_by': comparison of Float with NaN failed (ArgumentError)
The culprit is the third string. Classifier-Reborn classified its lsi_norm
as a vector of NaNs...
"For all sports fan, you must watch this video. Hey you have to check this out.\n"=>
#<ClassifierReborn::ContentNode:0x007fdec4b25ae8
@categories=[],
@lsi_norm=GSL::Vector
[ nan nan nan nan nan nan nan ... ],
@lsi_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@raw_norm=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@word_hash={:for=>1, :sport=>1, :fan=>1, :must=>1, :watch=>1, :video=>1, :hei=>1, :check=>1, :out=>1}>}
Changing the third string slightly resolves the issue.
zombie = ZombieWriter::MachineLearning.new
zombie.add_string(content: "This is filler text that I invented.This is also a paragraph that could be used")
zombie.add_string(content: "This post is amazing. Please take a look")
zombie.add_string(content: "For all sports fan, you must watch this video. Hey you have to check this out. Filler, filler, filler.")
array = zombie.generate_articles
p array
"For all sports fan, you must watch this video. Hey you have to check this out. Filler, filler, filler.\n"=>
#<ClassifierReborn::ContentNode:0x007fd931432fd0
@categories=[],
@lsi_norm=GSL::Vector
[ 6.205e-01 1.432e-01 1.432e-01 1.432e-01 1.432e-01 1.432e-01 0.000e+00 ... ],
@lsi_vector=GSL::Vector
[ 6.593e-01 1.522e-01 1.522e-01 1.522e-01 1.522e-01 1.522e-01 0.000e+00 ... ],
@raw_norm=GSL::Vector
[ 5.547e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@raw_vector=GSL::Vector
[ 6.272e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@word_hash={:for=>1, :sport=>1, :fan=>1, :must=>1, :watch=>1, :video=>1, :hei=>1, :check=>1, :out=>1, :filler=>3}>}
But why? Both scenarios appeared to have a @word_hash
, so it isn't quite clear why one string had a vector of NaNs and the other one doesn't. Is it because in the second scenario, the third string had words that were similar to that of the first string? I will have to research this issue more carefully and decide how to gracefully handle this potential error.
This problem is probably not likely to happen in the real-world...if you add long passages to ZombieWriter, there's bound to be a few overlaps of words that classifier-reborn
can detect. But it could happen...which is why I need to figure out how to fix it.
same problem here. hope to see an answer
Hi @mahaina. I'll see if I can work on this issue, probably in the next two weeks. If you have a sample corpus where this error can occur reliably, please send that over to me so that I can use it as 'test' material (though it's not necessary and I can work with the existing corpus within the OP). Right now though, I'm using those three sentences I mentioned in the OP, which allows me to reliably reproduce the error, but it's possible that your corpus might have some unique characteristics as well.