golearn icon indicating copy to clipboard operation
golearn copied to clipboard

Some interfaces / dependence discussion

Open lazywei opened this issue 11 years ago • 23 comments

As mentioned in other issues, there are some decisions we need to make.

  • Use biogo.matrix or mat64 Both are under development. mat64 lack docs, but author replies to the issues very fast, optimized memory usage. biogo.matrix docs are quite good, but I have no experience in using this.
  • Should our pairwise interface return scalar or a vector? Detailed discussion is here: https://github.com/sjwhitworth/golearn/pull/20#discussion_r12289865
  • Interface for data and IO. See https://github.com/sjwhitworth/golearn/blob/io/data/data.go, https://github.com/sjwhitworth/golearn/blob/io/data/string_frame.go This is really essential, because we need to settle down the format/methods/attributes so we can build trainer/cross_validator/predictor compatible with the data interfaces. Also, I think this should move to base package, due to it is related to many other packages in golearn.
  • How to organize third party libraries? For example, there is a linear_models/liblinear_src in https://github.com/sjwhitworth/golearn/pull/23. We need to agree a convention for how to include 3rd libraries.
  • Other interfaces?

Please leave comments about above issues. We should settle down these issues first. @sjwhitworth @ifesdjeen @npbool @marcoseravalli @macmania

lazywei avatar May 06 '14 04:05 lazywei

Should our pairwise interface return scalar or a vector? Detailed discussion is here: #20 (comment)

WRT to scalar, I think it may be a good idea, but I'm yet to see an exact use-case. Do we already have any algorithm that requires it?

How to organize third party libraries? For example, there is a linear_models/liblinear_src in #23. We need to agree a convention for how to include 3rd libraries.

My suggestion (although arguable) would be to try to minimize external dependencies. If there's an absolute necessity to make it, I'd make a separate repo with this particular algorithm and then provide a reference to it. But that's speculative. It'd just be great if we had less compile-time and dependency resolution problems.

ifesdjeen avatar May 06 '14 08:05 ifesdjeen

WRT to scalar, I think it may be a good idea, but I'm yet to see an exact use-case. Do we already have any algorithm that requires it?

Not yet, I think. Although user can achieve the same thing by using for loop or something like map or apply, but we may be able to do some optimized calculation if we implement this ourself. I have no specific opinion on this one, both (scalar, vector) are good to me.

My suggestion (although arguable) would be to try to minimize external dependencies. If there's an absolute necessity to make it, I'd make a separate repo with this particular algorithm and then provide a reference to it. But that's speculative. It'd just be great if we had less compile-time and dependency resolution problems.

I agree with that we should minimize external dependencies, and we should reduce compile-time and dependency resolution problems. However, somehow we just need those external libraries. For example, libsvm is the best library in terms of SVM, almost all SVMs in other languages (python, R etc.) are built based on libsvm. Same situation happens to liblinear. I'd prefer just put external libraries in our repo, though. The go get is poor at managing dependencies. If we put external libraries in separated repo, it may induce other problems.

lazywei avatar May 06 '14 08:05 lazywei

I have no opinion on the scalar/vector issue. I'll let you guys decide. Seems like it's working fine as it is right now.

External libraries - if we use them, we have to make sure that they are easily installable across platforms. The numpy/scikit-learn stack in Python is notoriously difficult to install - I don't want that to be the case with our library. It probably makes sense that we include them within the repo, but within a subfolder like ext, to ensure that people don't go digging around in the wrong stuff.

We should move to biogo.matrix. It seems to be the same package, but with much better documentation. If anyone has any problems, please let me know, otherwise we'll migrate.

Dataframes: only probably about static typing in Go is that we will either have strings, or float64's as labels, for categorical/continuous outcomes. How do we propose to solve this for users, without lots of ugly type assertion? Also, why would we use a string_frame? I'm not sure that I see the use case at the moment. The current dataframe looks good to me.

sjwhitworth avatar May 06 '14 19:05 sjwhitworth

:+1: @lazywei should I take over moving to biogo.matrix? If you're already familiar with it, I'd ask to let me do it, if possible.

ifesdjeen avatar May 06 '14 20:05 ifesdjeen

I have no opinion on the scalar/vector issue. I'll let you guys decide. Seems like it's working fine as it is right now.

OK, then Iet's focus on scalar only return at this stage.

External libraries - if we use them, we have to make sure that they are easily installable across platforms. The numpy/scikit-learn stack in Python is notoriously difficult to install - I don't want that to be the case with our library. It probably makes sense that we include them within the repo, but within a subfolder like ext, to ensure that people don't go digging around in the wrong stuff.

Totally agree with you! I hope our library can be installed easily! ext/ sounds good to me. We could provide something like make.go, so user can go get + go run make.go to finish installation.

We should move to biogo.matrix. It seems to be the same package, but with much better documentation. If anyone has any problems, please let me know, otherwise we'll migrate.

OK, let's migrate to biogo.matrix

Dataframes: only probably about static typing in Go is that we will either have strings, or float64's as labels, for categorical/continuous outcomes. How do we propose to solve this for users, without lots of ugly type assertion? Also, why would we use a string_frame? I'm not sure that I see the use case at the moment. The current dataframe looks good to me.

I think we can first assume the labels are string, and then provide a function to convert string labels to float64 labels. In such case, I think a Label struct is necessary, I'll implement it. The reason I'd like to have a StringFrame is because I think it's possible that each row in a dataset has more than one labels, e.g.:

12.2, 0.1, 3.4, positive, happy, relax
22.3, 3.1, 1.0, negative, sad, nervous

If that is the case, we can't just use []string to store labels. (we need [][]string, which should be wrapped) That being said, I think previous mentioned Label struct can resolve this problem. But the StringFrame will be more general. The question is, do we need StringFrame or just using Label is enough?

@lazywei should I take over moving to biogo.matrix? If you're already familiar with it, I'd ask to let me do it, if possible.

@ifesdjeen OK, thanks for your effort! I'll focus on DataFrame then.

lazywei avatar May 07 '14 04:05 lazywei

@lazywei - can you 'sketch' out an idea of what you'd want the StringFrame to look like, and how it would integrate in a training setting? It can be pseudocode - I'm just having a hard time visualising what you want it to be, at the moment.

@ifesdjeen - thanks for taking on the effort to migrate to biogo! Hopefully it should be as easy as just doing a find and replace ;)

sjwhitworth avatar May 07 '14 06:05 sjwhitworth

np np, will take a closer look at it tonight.

ifesdjeen avatar May 07 '14 07:05 ifesdjeen

@sjwhitworth It could be just simple manupulations. Just like string version's matrix. The idea raised in ParseCSV. I'd like to be able to parse CSV with multiple labels.

lazywei avatar May 07 '14 07:05 lazywei

But what if you have a dataset that is half floats, half strings? How do you do any learning based off of that?

sjwhitworth avatar May 07 '14 08:05 sjwhitworth

Oh... that's really a problem... Basically, we can train each label separately. Of course there are some algorithms need to consider all labels at the same time, but I think it might out of our scope at this moment.

I think the best way is force the labels to be all numeric. Classification labels can be converted to 0, 1, 2 etc. Regression labels can just be float64. So the problem is should we automatically convert classification labels to numeric in dataset I/O? How about something like

type Label struct {
values *mat64.Dense
categories map[int](map[int]string)
}

For example,

values = [[0, 1, 3.12], [1, 0, 5.134], ...]
categories = {
0: {0: "happy", 1: "sad"},
1: {0: "positive", 1: "negative"},
2: "regression values"
}

So we can have all labels in numeric, and we can still know what these values mean (which category, regression or classification etc.)

lazywei avatar May 07 '14 08:05 lazywei

Sounds good to me. Label encoding built in. Nice. :)

sjwhitworth avatar May 07 '14 09:05 sjwhitworth

OK! Let's GO!

Summary:

  • Use biogo.matrix to replace mat64
  • Metrics package should always return scalar
  • Introduce Label struct, I'll implement this one
  • Put all external dependencies into ext/

Any other suggestions? LGTM

lazywei avatar May 07 '14 09:05 lazywei

Nope! Let's do it!

sjwhitworth avatar May 07 '14 21:05 sjwhitworth

All agreed @ifesdjeen @npbool @marcoseravalli @macmania ?

sjwhitworth avatar May 07 '14 21:05 sjwhitworth

Wow, there's been a lot of activity on this since May 1st! I forked it off with a view to implement some of the algorithms I struggle with (context: I'm revising for a course in Data Mining). If you're familiar with WEKA (as I am) they have (IMHO) a nice solution to this problem that I've implemented (see instances.go and attributes.go. Instances contains the underlying memory (kept in a go.matrix still) and a slice of Attributes, which impose structure on the data and convert the native float64 format into something meaningful. I've implemented two Attribute types (CategoricalAttribute - which you can use to hold binned values, class values etc) and FloatAttribute which directly maps to the underlying type. This is all unit-tested and ready to go. Also see the docs.

Advantages

  • Instances provides an abstraction over the underlying representation: can store it in a plain array, slice, biogio.matrix, and you can swap it out at any time
  • If you extracted the Instances interface, could also offer specialised formats which are optimised for other scenarios (e.g. binary sparse matrices - hugely important in text mining)
  • Because it's an abstraction, you could also implement things like dropping columns and rows very cheaply

Disadvantages

  • Necessarily creates some overhead (mitigation: move these items into the Instances implementation)
  • Converting everything back and forth into float64 introduces a performance penalty and might be inappropriate for, say, binary types (mitigation: tighter coupling between Instances and attributes, variable-width specifiers, refactoring to use binary strings)

All in all, really promising project so far, let's hope I can save @lazywei some work.

Sentimentron avatar May 09 '14 13:05 Sentimentron

@Sentimentron Wow, that's really awesome. I think your Instance is basically the Label I want to implement. I have, however, some concerns:

  • Using Instance for labels is good, but would it be a over-kill for storing features? I mean, in most cases, features are just numeric values. Of course, sometimes features may be categorical. I have no too many experiences in training categorical datas, so I'm just wondering do we really need to deal with those values in our library? That being said, if the cost is cheap (in terms of memory usage, cpu usage etc.), I have no opinion on this :-)
  • The name Instance seems a little ambiguous. In my experience, instances are usually referred to training datas (more specifically, the training features). However, in your code, it seems that Instance is more general then that. It seems that we can use Instance for both training features and training labels, or even other data structures. Therefore, I think it may be good if we can come out a more meaningful name. (This is really a minor concern, though)
  • This library has some breaking changes. It would be better if you can rebase your implementation if you want to send a pull request.

@sjwhitworth do you have any suggestions or ideas on this matter?

Anyway, thanks your effort. I really like your implementation, it saves my life :+1: By the way, a little off-topic, if you are familiar with WEKA, and if you have time, could you help me implement the I/O functions of ARFF format? Thanks.

lazywei avatar May 09 '14 14:05 lazywei

Let's address those concerns:

  • The Instances type is directly modelled on WEKA. In WEKA, you store everything in Instances, attach Attributes to those Instances and then designate one of those as the "class variable" which can be categorical (e.g. "Iris-setosa" etc) or numeric (e.g. if you were implement a regression-type system of the type I'm not too familiar with). Similarly, this this version also allows you to to store both inside a single type. That said, it's not cheap (yet), because even if you only have 4 categorical labels, it still takes up 8 bytes of room as a value in the matrix.
  • Instances is very ambiguous and it's probably what you would name a base Interface covering lots of diverse implementations, so you'd probably retain the name Instances for generality and implement DenseInstances, SparseInstances, BinaryInstances etc that are optimised for certain use cases.
  • There will be some churn attached to this, but I bore that in mind and based on my experience with refactoring the KNN implementation I think it would be possible to migrate most of the new code quite quickly.

And edit: I'm also familiar with ARFF format, it's super-simple and essentially CSV apart from the header which specifies the types for each attribute explictly. Because I already revised the CSV importer quite a lot to use the new Instances type, about 90% of the code needed to support ARFF already exists and I have lots of them lying around (in various states of validity) for unit-testing.

Sentimentron avatar May 09 '14 14:05 Sentimentron

Hey guys sorry for being gone for quite some time (we had a major release, so I completely failed to keep up with OSS schedule), back on track now, gotta take care of matrix migration, hope it's still relevant.

Glad to still see some discussions and activity here.

ifesdjeen avatar May 10 '14 21:05 ifesdjeen

I had some concerns about using biogo.matrix.

In particular, it parovides no support for eigen or singular value decomposition, which are important for a plethora of dimensionality reduction problems. Gonum's mat64 package, on the other hand, supports both. Additionally, the goals of the biogo.matrix library seem to be primarily to act as a supplement to the biogo bioinformatics project. I don't foresee the library evolving to include the flexibility that a linear algebra library such as mat64 would provide. However, I'm pretty certain that such flexibility will be beneficial for our project.

I understand that mat64 is somewhat lacking in adequate documentation, but in light of the features that biogo.matrix lacks perhaps we could rethink the migration.

Any insights on this?

hpxro7 avatar Jun 05 '14 00:06 hpxro7

@hpxro7 I have no idea about how much work need to be done if we choose to rollback to mat64. On the other hand, would you think it is possible to implement those eigen computations ourself? If so, I think we can work on this together, while others can focus on ML algorithms.

If there are already basic sparse/dense matrix arithmetics, I think it won't be too hard to implement something like arnoldi iteration?

lazywei avatar Jun 05 '14 03:06 lazywei

If we were to rollback to mat64, that's not a problem for me: just have to revert the code which allocates and accesses the matrix.

Sentimentron avatar Jun 05 '14 08:06 Sentimentron

Looking at my fork off master, it seems like all of the matrix related code sits atop mat64. I couldn't find any references to biogo. I'm assuming then that most of the code written in biogo has yet to be pulled into master?

@lazywei I think that might be a cool idea, but I'd fathom that rolling back any biogo.matrix instances to mat64 would be far less challenging compared to re-implementing these somewhat involved linear algebra algorithms. My opinion is that since there is already a decent implementation of a matrix library, we could stick to using that instead of replicating what has already been done.

hpxro7 avatar Jun 06 '14 01:06 hpxro7

OK, I think switch back to gonum is an acceptable choice. If all of you guys think it is a good idea, then let's do it. I can work on the docs. Also, I think it would be a good idea that we stick to other gonum's packages: https://github.com/gonum It may be help, and I think it can save us much time. :+1:

lazywei avatar Jun 06 '14 05:06 lazywei