CloudForest
CloudForest copied to clipboard
applyforest not happy with gradient boosting via -gbt
Seems like the -gbt gradient boosting might need a little love. I could get it to grow, but applyforest was unhappy (crashed; see below).
## first without -gbt, works:
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ growforest -train data/iris.data.fm -rfpred forest.sf -target C:Class -mTry 2 -nTrees 20 -leafSize 1
Threads : 1
nTrees : 20
Loading data from: data/iris.data.fm
Target : C:Class
Non Target Features : 4
mTry : 2
non-missing cases: 150
leafSize : 1
nSamples : 150
Performing classification with 3 categories.
Total training time (seconds): 0.008251627
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ applyforest -fm ./data/iris.data.fm -rfpred forest.sf -preds pred.tsv
Target is C:Class in feature 4
Error: 0.00666666666666671
Outputting label predicted actual tsv to pred.tsv
## now with -gbt, growforest works, but applyforest crashes:
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ growforest -train data/iris.data.fm -rfpred forest.sf -target C:Class -mTry 2 -nTrees 20 -leafSize 1 -gbt 0.0001
Threads : 1
nTrees : 20
Loading data from: data/iris.data.fm
Target : C:Class
Non Target Features : 4
mTry : 2
non-missing cases: 150
leafSize : 1
nSamples : 150
Performing classification with 3 categories.
Using Gradient Boosting Classification with postive class: True
Total training time (seconds): 0.002043676
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ applyforest -fm ./data/iris.data.fm -rfpred forest.sf -preds pred.tsv
Target is C:Class in feature 4
Error: 1
Outputting label predicted actual tsv to pred.tsv
panic: interface conversion: CloudForest.VoteTallyer is *CloudForest.CatBallotBox, not *CloudForest.NumBallotBox
goroutine 1 [running]:
main.main()
/Users/jaten/go/src/github.com/lytics/CloudForest/applyforest/applyforest.go:96 +0x14af
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ git log|head
commit 1af2b08f235b69b4393276f293055e1d89d1a555
Merge: 0e79e70 7075fe5
Author: Tim Kaye <[email protected]>
Date: Wed Nov 21 10:24:03 2018 -0800
Merge pull request #18 from lytics/determinstic_ballotbox
Ensure that predictions/tallies are always deterministic
commit 7075fe572867b7e6ac7ea9fc4fea4f9f603c6025
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ go version
go version go1.10.2 darwin/amd64
I get all NaN predictions if I try using the -sum
flag.
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ applyforest -fm ./data/iris.data.fm -rfpred forest.sf -preds pred.tsv -sum
Target is C:Class in feature 4
Error: 1
Outputting label predicted actual tsv to pred.tsv
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $ head pred.tsv
1 NaN Iris-setosa
2 NaN Iris-setosa
3 NaN Iris-setosa
4 NaN Iris-setosa
5 NaN Iris-setosa
6 NaN Iris-setosa
7 NaN Iris-setosa
8 NaN Iris-setosa
9 NaN Iris-setosa
10 NaN Iris-setosa
jaten@jatens-MacBook-Pro ~/go/src/github.com/lytics/CloudForest (master) $
To avoid the panic
, I think we need to do something like this on linke 96 (where the panic happens):
https://github.com/lytics/CloudForest/blob/1af2b08f235b69b4393276f293055e1d89d1a555/applyforest/applyforest.go#L59-L69
I toyed with that a little.
Could we pop out to the bigger picture, for a moment?
Does the requirement to have the user need to specify the type of ballot box depending on the type of target used seem a little odd to you?
What I mean is, I would usually expect to call a Predict()
method, and the model -- which already knows what target was used during building -- would "do the right thing" -- would behave properly (polymorphically) for the type of Forest or ForestModel at hand.
Then the -mean
and -sum
flags would be rendered unnecessary.
Does the requirement to have the user need to specify the type of ballot box depending on the type of target used seem a little odd to you?
Yeah I think that is unusual. Having the ballot box type
be inferred sounds like a better approach. However, there isn't really a good way of determining the "Model type" from a Forest (ie there are no flags/fields on the forest struct that describe it)
https://github.com/lytics/CloudForest/blob/1af2b08f235b69b4393276f293055e1d89d1a555/forest.go#L17-L22
So one possible solution would be to add a ModelType
field (or something analogous) to the forest. This would eliminate the need for the -mean
and -sum
flags.
maybe something like this?
func predict(fm *FeatureMatrix, f *Forest) VoteTallyer {
data := fm.Data[fm.Map[f.Target]]
n := data.Length()
var tallyBox VoteTallyer
switch data.(type) {
case *DenseNumFeature:
tallyBox = NewNumBallotBox(n)
case *DenseCatFeature:
tallyBox = NewCatBallotBox(n)
default:
if f.Type == "gbm" {
tallyBox = NewSumBallotBox(n)
}
return nil
}
}
That all makes sense. Would not ideally a Predict() method would return not a VoteTallyer but the the set of predictions (Yhat(s)) corresponding to the fm data input?
func Predict(fm *FeatureMatrix, f *Forest) (yhat []float64) { ... }
The only tricky part is for classifiers rather than regressions. Then yhat is used to being []string, a slice of category strings predicted by the classifier, rather than []float64. But perhaps a map from category to encoding in whole numbers could be included as a part of the feature matrix... oh, wait, it is already there, as the Map map[string]int
, I believe.
type FeatureMatrix struct {
Data []Feature
Map map[string]int
CaseLabels []string
}
So would returning []float64 work for all types of trees, do you think?
predict
was a poor choice for the function name above. Instead something like
func getTallyerForForest(fm *FeatureMatrix, f *Forest) VoteTallyer {
...
}
could be better.
If you wanted to have a Predict
function that supports all model-types, it might make sense to return []string
instead of a []float64
, since all the ballot-boxes implement Tally(i int) (predicted string)
so something like:
func PredictStr(fm *FeatureMatrix, f *Forest) []string {
tallyer := getTallyerForForest(fm, f)
for _, tree := range f.Trees {
tree.Vote(fm, tallyer)
}
var preds []string
for i :=0; i < fm.Data[0].Length(); i++ {
preds = append(preds, tallyer.Tally(i))
}
return preds
}
well, regressors will need the float64, and parsing strings into float64 is slow/expensive.
I like your function. How about a simple "union" approach to the return value.
type Predictions struct {
// only one of Real or Cat will be populated
Real []float64
Cat []string
IsReal bool // true for regression Mean and gradient boosted Sum talliers; else Cat is used for classifier output
}
then Predict returns the right field populated, the other with length 0.
func Predict(fm *FeatureMatrix, f *Forest) (*Predictions, error) {
tallyer := getTallyerForForest(fm, f)
for _, tree := range f.Trees {
tree.Vote(fm, tallyer)
}
if isCat {
var preds []string
for i :=0; i < fm.Data[0].Length(); i++ {
preds = append(preds, tallyer.Tally(i))
}
return &Predictions{Cat:preds}, nil
}
// the logic for the regression and gradient boosting goes here
...
return &Predictions{Real: preds, IsCat:false}, nil
}
incomplete, but this pull requests gives an idea of what I was thinking:
https://github.com/lytics/CloudForest/pull/25