CloudForest icon indicating copy to clipboard operation
CloudForest copied to clipboard

What file formats should be supported for data and models?

Open ryanbressler opened this issue 11 years ago • 18 comments

ryanbressler avatar Feb 04 '14 19:02 ryanbressler

Libsvm file format has been requested here:

https://github.com/ryanbressler/CloudForest/issues/31

ryanbressler avatar Feb 24 '14 16:02 ryanbressler

ARFF and possibly unlabeled csv as commonly used by machine learning reopos

ryanbressler avatar Mar 01 '14 21:03 ryanbressler

Basic arff support is in and csv is supported now but only if you use it as a library since you need to define feature types.

Wondering if sparse arff and libsvm should be included and if a sparse feature representation is needed to do them well.

ryanbressler avatar Mar 02 '14 00:03 ryanbressler

maybe C4.5:

http://www.cs.washington.edu/dm/vfml/appendixes/c45.htm

ryanbressler avatar Mar 02 '14 04:03 ryanbressler

basic libsvm support is in

ryanbressler avatar Mar 03 '14 20:03 ryanbressler

How can I grow a cloudRF with libsvm file? (I don't know which a target to declare). e.g: ~/cloudRF/growforest -train usps.libsvm -rfpred usps.sf -target ??? -nTrees 1000 where usps.libsvm is a training data file.

tungntdhtl avatar Apr 14 '14 15:04 tungntdhtl

-target 0 should do it since the target is in the first column and their aren't column names

ryanbressler avatar Apr 14 '14 15:04 ryanbressler

I received some errors as below: ~/cloudRF/growforest -train usps -rfpred usps.sf -target 0 -nTrees 500 Threads : 1 nTrees : 500 Loading data from: usps panic: runtime error: index out of range

goroutine 1 [running]: runtime.panic(0x8186020, 0x836d037) /usr/local/go/src/pkg/runtime/panic.c:266 +0xac github.com/ryanbressler/CloudForest.ParseAFM(0xb772bab8, 0x18600468, 0x836fd50) /home/gm/golang/gopath/src/github.com/ryanbressler/CloudForest/featurematrix.go:294 +0xccd github.com/ryanbressler/CloudForest.LoadAFM(0xbff7f407, 0x4, 0x0, 0x0, 0x0) /home/gm/golang/gopath/src/github.com/ryanbressler/CloudForest/featurematrix.go:367 +0x2d4 main.main() /home/gm/golang/gopath/src/github.com/ryanbressler/CloudForest/growforest/growforest.go:168 +0x1009

tungntdhtl avatar Apr 14 '14 15:04 tungntdhtl

You need to rename usps to usps.libsvm so that growforest knows how to parse it.

ryanbressler avatar Apr 14 '14 16:04 ryanbressler

Also do an update if you haven't as I recently fixed some small bugs with libsvm support.

ryanbressler avatar Apr 14 '14 16:04 ryanbressler

Great! It is running. You should write some comments abt this for CloudRF's users :) Thanks Ryan!

tungntdhtl avatar Apr 14 '14 16:04 tungntdhtl

ryanbressler commented "-target 0 should do it since the target is in the first column and their aren't column names" How does CloudRF recognite the data type of the target response? (B:, N:, or C:)

tungntdhtl avatar Apr 15 '14 03:04 tungntdhtl

It checks to see if the first entry is an int or a float. Ints are handled as C of B. Floats as N...if you want regression and the first entry is an int just make sure it is written with a decimal point (ie 0.0 non 0)

On Mon, Apr 14, 2014 at 9:50 PM, tungntdhtl [email protected]:

ryanbressler commented "-target 0 should do it since the target is in the first column and their aren't column names" How does CloudRF recognite the data type of the target response? (B:, N:, or C:)

— Reply to this email directly or view it on GitHubhttps://github.com/ryanbressler/CloudForest/issues/30#issuecomment-40443177 .

ryanbressler avatar Apr 15 '14 04:04 ryanbressler

OK, thanks! That is a good way. It also can read spare libsvm format file, right? i.e. Xi and Yi represent such as col:value e.g data with 100 features: 3 1:1 5:2.5 16:8 19:0.4 50:-1.2 55:1 72:4 85:6 90:3.2 98: 3.8 100: 6.2

tungntdhtl avatar Apr 15 '14 05:04 tungntdhtl

Yes, all unspecified features will be assumed to be zero.

On Mon, Apr 14, 2014 at 11:05 PM, tungntdhtl [email protected]:

OK, thanks! That is a good way. It also can read spare libsvm format file, right? i.e. Xi and Yi represent such as col:value e.g data with 100 features: 3 1:1 5:2.5 16:8 19:0.4 50:-1.2 55:1 72:4 85:6 90:3.2 98: 3.8 100: 6.2

— Reply to this email directly or view it on GitHubhttps://github.com/ryanbressler/CloudForest/issues/30#issuecomment-40446130 .

ryanbressler avatar Apr 15 '14 05:04 ryanbressler

In LIBSVM file containing lots of records (e.g 60,000,000), how can I build trees in couldRF?

I try setting a portion of total records using "nSamples=0.1" option, that means cloudRF works only 10% of total sample? If yes, how can I take a bootstrap samples of total records using their portion? i.e. each tree grows from 10% of total records, each 10% records was random samples from total records

tungntdhtl avatar Apr 15 '14 05:04 tungntdhtl

Random forest bags samples independently for each tree so I think it is already doing what you are asking for.

On Mon, Apr 14, 2014 at 11:59 PM, tungntdhtl [email protected]:

In LIBSVM file containing lots of records (e.g 60,000,000), how can I build trees in couldRF?

I try setting a portion of total records using "nSamples=0.1" option, that means cloudRF works only 10% of total sample? If yes, how can I take a bootstrap samples of total records using their portion? i.e. each tree grows from 10% of total records, each 10% records was random samples from total records

— Reply to this email directly or view it on GitHubhttps://github.com/ryanbressler/CloudForest/issues/30#issuecomment-40448391 .

ryanbressler avatar Apr 15 '14 06:04 ryanbressler

I mean RF struggles to build trees from large samples size because of a tree size is large. In cloudRF, RF can grow from a portion of total records. My question is that what is the scope of that portion? it uses all bagged records or just only small records independently (e.g 10%).

tungntdhtl avatar Apr 15 '14 06:04 tungntdhtl