amazon-dsstne icon indicating copy to clipboard operation
amazon-dsstne copied to clipboard

The sample doesn't seem to trigger the sparse path

Open skyw opened this issue 8 years ago • 4 comments

It looks the default ml20-all sample doesn't call sparse kernels. All I see is cublasSgemm. Log says "NNDataSet::CalculateSparseDatapointCounts: Maximum sparse datapoints (9254) per example in dataset gl_input too large for fast sparse kernels."

Does that mean, given the sparsity of this case, dense sgemm still out perform the sparse ones?

skyw avatar Jun 08 '16 23:06 skyw

TLDR: Correct that it doesn't call the fast path.

Longer answer, if the maximum data counts were within the specified limit, one could roughly double performance here. I'll try to hack together a demo that ignores the 9255th movie and beyond to illustrate this. The performance gains here are entirely from more efficient management of sparse data.

On Jun 8, 2016 6:45 PM, "skyw" [email protected] wrote:

It looks the default ml20-all sample doesn't call sparse kernels. All I see is cublasSgemm. Log says "NNDataSet::CalculateSparseDatapointCounts: Maximum sparse datapoints (9254) per example in dataset gl_input too large for fast sparse kernels."

Does that mean, given the sparsity of this case, dense sgemm still out perform the sparse ones?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amznlabs/amazon-dsstne/issues/37, or mute the thread https://github.com/notifications/unsubscribe/ARNK9izQUjhakosezBUYkoCsGVYvYq2Yks5qJ1QUgaJpZM4Ixg5P .

scottlegrand avatar Jun 09 '16 01:06 scottlegrand

Thanks for the fast reply. So the "fast path" is not faster in this case? Is there another sample calls the fast path?

skyw avatar Jun 09 '16 01:06 skyw

Even without the sparse kernels, DSSTNE's management of sparse data beats TensorFlow because I believe the latter relies of cuSparse and it is performing a signficant amount of uploading of data. Evidence: a Grid K520 GPU is ~2.5x faster with DSSTNE, but on a GTX TitanX DSSTNE is nearly 6x faster than Tensorflow on GTX TitanX.

This means that TensorFlow is bottlenecking on something that didn't improve between those 2 GPUs.

Interestingly, the data sets for which DSSTNE was developed were 0.1% dense or less. Movielens 20M is approximately 0.4% dense. At some higher density, cuSparse should ultimately beat DSSTNE's kernels, but I'm not yet sure where the crossover point lies.

scottlegrand avatar Jul 18 '16 02:07 scottlegrand

Is there anything more to do with this issue? If not, can it be closed?

ekandrotA9 avatar Nov 07 '17 01:11 ekandrotA9