decisiontree
decisiontree copied to clipboard
Infinite recursion due to the handling of zero gains in id3_continuous
On a real production dataset with 5 explanatory variables and ~1000 lines, I received a SystemStackError: stack level too deep when calling DecisionTree::ID3Tree#train.
Trying to figure out what was happening, I built the following simple dataset, which allows to reveal the bug:
attributes = ["X0", "X1", "X2", "X3"]
data = [["a", 0, 1, 1, 1], ["a", 0, 1, 0, 0], ["a", 0, 0, 1, 0], ["a", 0, 0, 0, 1]]
data_type = {X0: :discrete, X1: :continuous, X2: :continuous, X3: :continuous}
tree = DecisionTree::ID3Tree.new(attributes, data, 1, data_type)
tree.train # SystemStackError is raised here!
The reason of this bug seems to lie in the specific output ([-1, -1]) of DecisionTree::ID3Tree#id3_continuous in the case if values.size == 1 (see this line).
Returning [0, -1] instead of [-1, -1] in the cases if values.size == 1 and if gain.size == 1 in the method #id3_continuous solves the problem.
It would also be relevant to stop the recursion in the case where the selection of each variable leads to a zero gain. That can be done adding in #id3_train the following line:
return data.first.last if performance.all? { |a, b| a <= 0 }
after this line:
performance = attributes.collect { |attribute| fitness_for(attribute).call(data, attributes, attribute) }
What do you think?
Do you want me to make a pull request with these changes?