Ben Sadeghi

Singapore

Results 40 comments of


                                            Ben Sadeghi

Some questions about `prune_tree`.

@Eight1911 It's worth a try to see how the above approach affects how the pruning exercise "feels" like. You can test it out using the [Iris pruning runs](https://github.com/bensadeghi/DecisionTree.jl/blob/9a6d9e53e6a82a307d36ef2feff4d52db93b997c/test/classification/iris.jl#L26). The current...

Some questions about `prune_tree`.

@Eight1911 I don't think we need another pre-pruning criteria. Note that scikit-learn is [deprecating](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) `min_impurity_split` in favor of `min_impurity_decrease`. I'm, personally, quite comfortable with the current implentation of `prune_tree` for...

Problem with adaboost

Yeah, ```build_adaboost_stumps``` has always had issues, and it does use a different optimization technique than ```build_tree```, which is quite slow. Not sure what to do here; it requires significant work....

Input checking

Yeah, this is an issue. Back to your example, note that the tree generated is actually a leaf, and so there is no decision to be made based on input...

Input checking

I'm still hesitant to add a new field to the ```Node``` type. If this issue is handled in SKL.jl, then it's ok. And yes, the bloated models need to be...

weight values for the features.

The split routines already identify which features have the most predictive power (information gain) via Shannon entropy. So IMO, manually identifying/defining which features are of high importance is unnecessary, and...

categorical features handled "correctly"?

Works fine on the DT.jl side. The issue might be with MLJ.jl, potentially need to overload isless(). ```julia using Random, DecisionTree features, labels = load_data("adult") # Note that the data...

categorical features handled "correctly"?

@ablaom Yes, lexicographical order is used for the splitting criteria, where subsets of the features are sorted before being searched through for the best split (via information gain). I'm not...

categorical features handled "correctly"?

Thanks @ablaom. I've updated the readme with your input.

Excessive memory usage

You could cast the features to a concrete type (ie `X = Int.(X)`) as opposed to using the `Any` type, which is quite heavy. That should help a little bit....

‹
1
2
3
4
›