amazon-dsstne icon indicating copy to clipboard operation
amazon-dsstne copied to clipboard

Output layer question

Open bkj opened this issue 7 years ago • 6 comments
trafficstars

The output layer in these networks is often a bottleneck, because you have to do a (batch_size, hidden_dim) by (hidden_dim, num_classes) dense matrix multiplication. It doesn't seem like you'd get a speed up just by avoiding storing/multiplying by zeros -- are you doing any kinds of tricks here to reduce the cost of that operation?

Thanks ~ Ben

bkj avatar Jun 08 '18 00:06 bkj

We have some ideas here based on approximate kNN methods. Stay tuned.

scottlegrand avatar Jun 12 '18 18:06 scottlegrand

Interesting -- are you thinking just for inference or for both inference and training?

bkj avatar Jun 12 '18 19:06 bkj

It's a no-brainer for inference, it's a science project for training.

On Tue, Jun 12, 2018, 12:43 PM Ben Johnson [email protected] wrote:

Interesting -- are you thinking just for inference or for both inference and training?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amzn/amazon-dsstne/issues/178#issuecomment-396709649, or mute the thread https://github.com/notifications/unsubscribe-auth/ARNK9qAznaN0s3AZHISltNTnAnIXJFaVks5t8BnQgaJpZM4UfT6H .

scottlegrand avatar Jun 12 '18 19:06 scottlegrand

Relevant code and paper, if you haven't seen it: https://github.com/rdspring1/LSH_DeepLearning I don't think they did it on GPUs or big models, but maybe interesting

bkj avatar Jun 12 '18 19:06 bkj

Yep, I came up with roughly the same idea last Summer and then I read that paper that fleshed it out even further last Fall. My sparse input kernels end up ~20x faster than SGEMM so their seeing more or less the same speedup with LSH makes intuitive sense: one ends up memory-limited. Older GPUs would be a bear for this, but newer GPUs support arbitrary cuda streams and they have much larger L2 caches so it shouldn't be all that hard to write.

On Tue, Jun 12, 2018 at 12:47 PM, Ben Johnson [email protected] wrote:

Relevant code and paper, if you haven't seen it: https://github.com/rdspring1/LSH_DeepLearning I don't think they did it on GPUs or big models, but maybe interesting

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amzn/amazon-dsstne/issues/178#issuecomment-396710808, or mute the thread https://github.com/notifications/unsubscribe-auth/ARNK9vofNis0ZyK1YVvZrZtVIfuu6FnVks5t8BrIgaJpZM4UfT6H .

scottlegrand avatar Jun 12 '18 19:06 scottlegrand

Any updates on this? I'm trying to find an example of a library that uses approximate kNN methods to speed up the output layer.

bkj avatar Nov 21 '18 20:11 bkj