rusty-machine
rusty-machine copied to clipboard
Adding text vectorization
Hey guys and ladies!
I was wondering (and I'm offering myself to work a little bit on this) if you consider appropiate to add some text vectorization to rusty-machine based on sklearn current features:
- Simple frecuency count
- TF-IDF
- Hashing techniques (frecuency count + hashing trick) I'd be pretty cool to add some examples of sentiment analysis or something like that using rusty-machine only :P
I'd personally love that!
I was thinking on using the Transformer trait. However is not appropiate because it ask that the input and output should be of the same type
I agree that this is a really great idea!
It seems an unfortunate restriction that you cannot use the Transformer trait. I think that it might be worth changing the trait to allow different input and output types. Do either of you see any reason why this might cause issues? It would be a fairly minor breaking change (for users who have implemented the trait themselves).
I'm just implemented a Vectorizer trait that is pretty similar to Transformer, it could be used as base for non text stuff, like images or nested data for example. Here is a little proof of concept:
https://github.com/z1mvader/rusty-machine/blob/master/src/data/vectorizers/text.rs
But if @AtheMathmo wants we could just modify the Transformer trait
Besides the Transformer trait, I believe that there are two main needs for the text vectorization workflow. First, to be able to set your own tokenizer. And second, to allow sparse matrices/vectors. I don't know if rusty-machine supports sparse matrices right now