What file formats should be supported for data and models?
Libsvm file format has been requested here:
https://github.com/ryanbressler/CloudForest/issues/31
ARFF and possibly unlabeled csv as commonly used by machine learning reopos
Basic arff support is in and csv is supported now but only if you use it as a library since you need to define feature types.
Wondering if sparse arff and libsvm should be included and if a sparse feature representation is needed to do them well.
maybe C4.5:
http://www.cs.washington.edu/dm/vfml/appendixes/c45.htm
basic libsvm support is in
How can I grow a cloudRF with libsvm file? (I don't know which a target to declare). e.g: ~/cloudRF/growforest -train usps.libsvm -rfpred usps.sf -target ??? -nTrees 1000 where usps.libsvm is a training data file.
-target 0 should do it since the target is in the first column and their aren't column names
I received some errors as below: ~/cloudRF/growforest -train usps -rfpred usps.sf -target 0 -nTrees 500 Threads : 1 nTrees : 500 Loading data from: usps panic: runtime error: index out of range
goroutine 1 [running]: runtime.panic(0x8186020, 0x836d037) /usr/local/go/src/pkg/runtime/panic.c:266 +0xac github.com/ryanbressler/CloudForest.ParseAFM(0xb772bab8, 0x18600468, 0x836fd50) /home/gm/golang/gopath/src/github.com/ryanbressler/CloudForest/featurematrix.go:294 +0xccd github.com/ryanbressler/CloudForest.LoadAFM(0xbff7f407, 0x4, 0x0, 0x0, 0x0) /home/gm/golang/gopath/src/github.com/ryanbressler/CloudForest/featurematrix.go:367 +0x2d4 main.main() /home/gm/golang/gopath/src/github.com/ryanbressler/CloudForest/growforest/growforest.go:168 +0x1009
You need to rename usps to usps.libsvm so that growforest knows how to parse it.
Also do an update if you haven't as I recently fixed some small bugs with libsvm support.
Great! It is running. You should write some comments abt this for CloudRF's users :) Thanks Ryan!
ryanbressler commented "-target 0 should do it since the target is in the first column and their aren't column names" How does CloudRF recognite the data type of the target response? (B:, N:, or C:)
It checks to see if the first entry is an int or a float. Ints are handled as C of B. Floats as N...if you want regression and the first entry is an int just make sure it is written with a decimal point (ie 0.0 non 0)
On Mon, Apr 14, 2014 at 9:50 PM, tungntdhtl [email protected]:
ryanbressler commented "-target 0 should do it since the target is in the first column and their aren't column names" How does CloudRF recognite the data type of the target response? (B:, N:, or C:)
— Reply to this email directly or view it on GitHubhttps://github.com/ryanbressler/CloudForest/issues/30#issuecomment-40443177 .
OK, thanks! That is a good way. It also can read spare libsvm format file, right? i.e. Xi and Yi represent such as col:value e.g data with 100 features: 3 1:1 5:2.5 16:8 19:0.4 50:-1.2 55:1 72:4 85:6 90:3.2 98: 3.8 100: 6.2
Yes, all unspecified features will be assumed to be zero.
On Mon, Apr 14, 2014 at 11:05 PM, tungntdhtl [email protected]:
OK, thanks! That is a good way. It also can read spare libsvm format file, right? i.e. Xi and Yi represent such as col:value e.g data with 100 features: 3 1:1 5:2.5 16:8 19:0.4 50:-1.2 55:1 72:4 85:6 90:3.2 98: 3.8 100: 6.2
— Reply to this email directly or view it on GitHubhttps://github.com/ryanbressler/CloudForest/issues/30#issuecomment-40446130 .
In LIBSVM file containing lots of records (e.g 60,000,000), how can I build trees in couldRF?
I try setting a portion of total records using "nSamples=0.1" option, that means cloudRF works only 10% of total sample? If yes, how can I take a bootstrap samples of total records using their portion? i.e. each tree grows from 10% of total records, each 10% records was random samples from total records
Random forest bags samples independently for each tree so I think it is already doing what you are asking for.
On Mon, Apr 14, 2014 at 11:59 PM, tungntdhtl [email protected]:
In LIBSVM file containing lots of records (e.g 60,000,000), how can I build trees in couldRF?
I try setting a portion of total records using "nSamples=0.1" option, that means cloudRF works only 10% of total sample? If yes, how can I take a bootstrap samples of total records using their portion? i.e. each tree grows from 10% of total records, each 10% records was random samples from total records
— Reply to this email directly or view it on GitHubhttps://github.com/ryanbressler/CloudForest/issues/30#issuecomment-40448391 .
I mean RF struggles to build trees from large samples size because of a tree size is large. In cloudRF, RF can grow from a portion of total records. My question is that what is the scope of that portion? it uses all bagged records or just only small records independently (e.g 10%).