JuliaDB.jl icon indicating copy to clipboard operation
JuliaDB.jl copied to clipboard

Add ML.text to ML Schema

Open Akaban opened this issue 7 years ago • 0 comments
trafficstars

Hello,

I've found the ML Schema feature very interesting but would not it be great to have ML.Text alongside ML.Continuous and ML.Continuous to identify text features? This could enable the user to process text directly with JuliaDB (using one-hot encoding for example, or projection embeddings like Word2Vec, FastText, Glove, ...).

ML.Text could be detected as a ML.Categorial feature that has a very high cardinality, for example let say we set 0.2 as our threshold (which would be between 0 and 1). If the ML.Categorical feature has more than 0.2 * (n_rows_of_dataset) unique values then it is infact a ML.Text feature otherwise it's ML.Categorical.

Or we just let the user decide using hints (feature detected as nothing could be set as ML.Text by the user)

Another interesting schema type would be ML.UniqueID which could identify an entity or user, it can then be processed using something like User2Vec (learns an embedding of user using the other features). Altough the ML.Text feature would be much more important.

Any thoughts about this? :)

Cheers,

Bryce Tichit

Akaban avatar Aug 18 '18 17:08 Akaban