classifier-reborn icon indicating copy to clipboard operation
classifier-reborn copied to clipboard

Other classification/nlp tools

Open Ch4s3 opened this issue 7 years ago • 13 comments

We already do a bag of words, and word counts. Would it be useful to anyone to expose this functionality for other classification uses?

Some other things to consider:

  • [ ] N-grams
  • [ ] Levenshtein distance
  • [ ] Sentiment analysis
  • [ ] term frequency–inverse document frequency

Ch4s3 avatar Jan 02 '17 06:01 Ch4s3

@tra38 could you elaborate on which part(s) you're interested in?

Ch4s3 avatar Jan 06 '17 04:01 Ch4s3

Sure. Since classifier-reborn already collects a bunch of data already, it makes sense to publicly expose the data that "classifier-reborn" gathered, so that a programmer can then feed that data into other gems that handle different classification/nlp tasks.

For example, I've been clustering articles using classifier-reborn and kmeans-clusterer with the following code snippet:

require 'classifier-reborn'
require 'kmeans-clusterer'

lsi = ClassifierReborn::LSI.new

strings = ["example string a", "example string b", "example string c"]

strings.each do |x|
  lsi.add_item(x)
end

# Save transformed ClassiferReborn Content Nodes into new array
string_data = lsi.instance_variable_get(:"@items")

# Process the information for use in kmeans-clusterer
data = strings.map do |string|
  string_data[string].lsi_norm.to_a
end

clusters = 13
kmeans = KMeansClusterer.run clusters, data, labels: strings, runs: 10

And obviously, it's kinda hacky to try to get the lsi_norm for each individual content node just so that you can then do some k-means clustering, which is why I gave you a "thumbs up" for considering exposing this data more directly. (And if I'm using some aspect of classifer-reborn strangely here, then some other programmer will use bags of words and word counts strangely as well. Expose all the data, trust the programmer.)

tra38 avatar Jan 06 '17 04:01 tra38

@tra38 I think we could expose the lsi data. It'll probably take some careful refactoring, but should be doable.

Ch4s3 avatar Jan 06 '17 06:01 Ch4s3

Will there be multiple classification? For example: given an input, classify it into more than one category.

Looooong avatar Jan 11 '17 10:01 Looooong

@Looooong not with bayes, that's not really how it works.

Ch4s3 avatar Jan 11 '17 13:01 Ch4s3

@Looooong: Will there be multiple classification? For example: given an input, classify it into more than one category.

You can get the raw score of each category against a given text in Bayes. This way you can decide to get top-K relevant categories, if that is what you are after.

ibnesayeed avatar Jan 15 '17 03:01 ibnesayeed

Should we also consider adding ruby-fann (Fast Artificial Neural Network). It wont be good for text data I guess, but for numeric stuff it would be great.

ibnesayeed avatar Jan 15 '17 03:01 ibnesayeed

@ibnesayeed Yes, I am planning to make multiple score with Bayes, but I guess it will take up a big amount of storage space.

Looooong avatar Jan 15 '17 08:01 Looooong

@Looooong it really depends on the amount of training data. Between Bayes and LSI the first one would take relatively less space. If you have huge amount of data then here are a few things you can do:

  • Use the newly introduced Redis backend for storage, which would still take the required amount of memory, but it can be off-loaded to a remote machine that has high memory. Additionally, it will persist the data on the disk in case of any sudden crashes.
  • Use a sample of training data, not the whole of it. Then throw a bunch of test data to see how well it is performing and how many false positives and false negatives you are getting. If the classifier is giving satisfactory results then no need to train further, other wise train with more data and measure the results again. This way you can find the right balance between how much memory you can afford and the minimum accuracy you can accept as the trade off.
  • If you really want to use all the training data, but can't afford enough memory then you can implement an ORM backend and save the model in your favorite database. This would be terribly slow as compared with Memory backend, but you can train with petabytes of data. Implementing that wont be difficult as the storage stuff was abstracted recently.

ibnesayeed avatar Jan 15 '17 14:01 ibnesayeed

Since Hasher is basically a bag of words implementation, it might make sense to rename it as such and make it public and document it as such. Thoughts? @ibnesayeed?

Ch4s3 avatar Jan 17 '17 22:01 Ch4s3

Since Hasher is basically a bag of words implementation, it might make sense to rename it as such and make it public and document it as such. Thoughts? @ibnesayeed?

I agree. However, I would note one thing here that I encountered today while writing tests for stopwords. This needs to be instantiated and used as dependency injection during classifier initialization so that one classifier does not step over on the other's state due to the shared data. Currently, if in a single program, two classifiers are instantiated with different configuration, and one of them is making changes in the set of stopwords then the other classifier will also get affected. To overcome this issue in the tests I had to store the original stopwords in an instance variable in the setup method then restore that in the teardown method otherwise many other tests were failing.

ibnesayeed avatar Jan 18 '17 01:01 ibnesayeed

To overcome this issue in the tests I had to store the original stopwords in an instance variable in the setup method then restore that in the teardown method otherwise many other tests were failing.

Yeah I noticed that. I think DI is the way to go then.

Ch4s3 avatar Jan 18 '17 03:01 Ch4s3

Yeah I noticed that. I think DI is the way to go then.

In fact I had many other test cases in mind around stopwords that I could not put in place because they were seemingly very difficult (if not impossible). Similarly, some test cases could have been put together as part of assertions, but I had to separate them and duplicate most of the logic because of this stepping over behavior.

ibnesayeed avatar Jan 18 '17 03:01 ibnesayeed