glint icon indicating copy to clipboard operation
glint copied to clipboard

Can glint support discrete feature id?

Open cstur4 opened this issue 8 years ago • 6 comments

I have millions of discrete 64-byte features, remapping features to continuous ids is expensive. Can glint support that?

cstur4 avatar Jul 26 '16 05:07 cstur4

Although remapping is expensive, it's currently the only feasible thing to do with Glint since all matrices and vectors are stored as dense arrays.

One of my goals right now is to write a roadmap that outlines several things we want to implement in the near future. It includes sparse matrices and vectors. With sparsity it will be possible to create very large feature spaces with little/no memory overhead. However, because everything is indexed with Long values, it would still be limited to a size of 2^63.

rjagerman avatar Jul 26 '16 09:07 rjagerman

How long will it take? I am eager to intergrate glint with YARN with Long key features. I have a preliminary version, but it takes long time to work on bigger data.

cstur4 avatar Jul 26 '16 09:07 cstur4

I can't give an accurate time estimate at the moment. In terms of importance, I prioritize implementing fault tolerance over other features right now, so it could be on the order of months before I get to sparsity.

And even then, I'm not sure how well this will work... Sparse data structures typically come with considerable memory overhead (one or more objects per key/value pair), which the JVM garbage collector unfortunately does not like. I'm considering using something like debox or scala-offheap to bypass this garbage collection problem, but both are rather experimental.

rjagerman avatar Jul 27 '16 08:07 rjagerman

I used debox in my algorithm, and it helps a lot. Now I transform ids to continuous ones, and I look forward to get your help to intergration glint with spark. Thanks a lot.

cstur4 avatar Jul 27 '16 09:07 cstur4

I implemented a key-value based partitioner to avoid remapping feature id. Hash-based version may be more scalable to big data. I am glad to send a pull request if there is a necessary.

cstur4 avatar Aug 08 '16 02:08 cstur4

@cstur4 I'd like to see that PR - for sparse models I'm looking at it would be very important for scalability.

MLnick avatar Aug 08 '16 07:08 MLnick