rusty-machine icon indicating copy to clipboard operation
rusty-machine copied to clipboard

Adding text vectorization

Open ghost opened this issue 8 years ago • 5 comments

Hey guys and ladies!

I was wondering (and I'm offering myself to work a little bit on this) if you consider appropiate to add some text vectorization to rusty-machine based on sklearn current features:

  • Simple frecuency count
  • TF-IDF
  • Hashing techniques (frecuency count + hashing trick) I'd be pretty cool to add some examples of sentiment analysis or something like that using rusty-machine only :P

ghost avatar Apr 24 '17 04:04 ghost

I'd personally love that!

tafia avatar Apr 24 '17 05:04 tafia

I was thinking on using the Transformer trait. However is not appropiate because it ask that the input and output should be of the same type

ghost avatar Apr 24 '17 16:04 ghost

I agree that this is a really great idea!

It seems an unfortunate restriction that you cannot use the Transformer trait. I think that it might be worth changing the trait to allow different input and output types. Do either of you see any reason why this might cause issues? It would be a fairly minor breaking change (for users who have implemented the trait themselves).

AtheMathmo avatar Apr 24 '17 17:04 AtheMathmo

I'm just implemented a Vectorizer trait that is pretty similar to Transformer, it could be used as base for non text stuff, like images or nested data for example. Here is a little proof of concept:

https://github.com/z1mvader/rusty-machine/blob/master/src/data/vectorizers/text.rs

But if @AtheMathmo wants we could just modify the Transformer trait

ghost avatar Apr 24 '17 18:04 ghost

Besides the Transformer trait, I believe that there are two main needs for the text vectorization workflow. First, to be able to set your own tokenizer. And second, to allow sparse matrices/vectors. I don't know if rusty-machine supports sparse matrices right now

ghost avatar Apr 24 '17 20:04 ghost