SharpLearning icon indicating copy to clipboard operation
SharpLearning copied to clipboard

Feature Suggestion

Open mutigozel opened this issue 8 years ago • 6 comments

First of all, I want to congratulate you for this project. I have a suggestion, couldn't figure where to write it other than issues on GitHub.

My suggestion is, number of observations (or better their indexes, one can count them) that fall to the left and right child of a node.

mutigozel avatar Apr 11 '17 15:04 mutigozel

It is an interesting feature. However, I am a bit reluctant to add more data to the decision tree nodes, especially something like a list, which can potentially become very large. A RandomForest for instance, will typically consist of many hundred thousand nodes. If each node has a large list of indices, the memory footprint of the model will increase significantly.

To help me better understand the feature suggestion, maybe you can add some more information about how you want to use this feature? A short use-case or similar?

mdabros avatar Apr 12 '17 20:04 mdabros

I understand that it will increase the memory foot print. Adding an option maxTreeDepth could lead to a better use.

Adding the indices to each node (that ones fall to left or right) could lead to more statistical information about the node. Like how many of them are on the right, what is the total feature on the right, average feature on the left, sigma of right observations, median of the lefts ... Especially one could generate other stats about other variables rather than using the feature variable of the node.

My example case would be like, lets say I am doing decision tree analysis using volume generated in each division which consists of different count of people. I would like to see the per person volume for left observations an right observations.

Division - Volume - People Div01 - 100K - 2 Div03 - 50K - 1 Div04 - 1M - 10 Div90 - 740K - 100

mutigozel avatar Apr 12 '17 22:04 mutigozel

If I understand correctly, what you are seeking seems to be a tool for decision tree analysis, like described here: https://en.wikipedia.org/wiki/Decision_tree. Where the main purpose of the decision tree is to gain information from the structure of the tree and/or to help identify a strategy for reaching a certain goal. Is that correctly understood?

mdabros avatar Apr 13 '17 11:04 mdabros

Any tree analysis could be easily generated using the observation row indices that fall within the node during learning. So far, column index is already included with the FeatureIndex property of the node. I'm only suggesting adding the row indices as a property to the node.

mutigozel avatar Apr 13 '17 11:04 mutigozel

The main purpose of SharpLearning is to provide machine learning algorithms and models for prediction. So decision tree analysis, while very useful, is something I would categorize as a secondary feature in this project. Especially if implementing the feature will increase the size of the models.

However, I am planning a major refactoring of the decision tree based learners and models after I finish my current work on neural nets. If I can find a memory efficient way to include the extra information required, I will see if I can make this feature request a part of the refactoring. So I will leave the feature request open for now an update it when I know more.

mdabros avatar Apr 13 '17 12:04 mdabros

Hello Mads,

I'm a visual coding heavy guy. I constantly try to figure best ways to visually represent mathematical algorithms. Making it easily understandable by a non technical person.

Anyhow, I converted the outcome to the structure I require (parent-child) with the calculations. Here is how it looks. Hope that will help you with further improvement of the library. I would like to contact you with an online meeting for better communication. Let me know trough linked-in if you do.

Regards,

a

mutigozel avatar Apr 17 '17 12:04 mutigozel