amazon-dsstne
amazon-dsstne copied to clipboard
The sample doesn't seem to trigger the sparse path
It looks the default ml20-all sample doesn't call sparse kernels. All I see is cublasSgemm. Log says "NNDataSet::CalculateSparseDatapointCounts: Maximum sparse datapoints (9254) per example in dataset gl_input too large for fast sparse kernels."
Does that mean, given the sparsity of this case, dense sgemm still out perform the sparse ones?
TLDR: Correct that it doesn't call the fast path.
Longer answer, if the maximum data counts were within the specified limit, one could roughly double performance here. I'll try to hack together a demo that ignores the 9255th movie and beyond to illustrate this. The performance gains here are entirely from more efficient management of sparse data.
On Jun 8, 2016 6:45 PM, "skyw" [email protected] wrote:
It looks the default ml20-all sample doesn't call sparse kernels. All I see is cublasSgemm. Log says "NNDataSet::CalculateSparseDatapointCounts: Maximum sparse datapoints (9254) per example in dataset gl_input too large for fast sparse kernels."
Does that mean, given the sparsity of this case, dense sgemm still out perform the sparse ones?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amznlabs/amazon-dsstne/issues/37, or mute the thread https://github.com/notifications/unsubscribe/ARNK9izQUjhakosezBUYkoCsGVYvYq2Yks5qJ1QUgaJpZM4Ixg5P .
Thanks for the fast reply. So the "fast path" is not faster in this case? Is there another sample calls the fast path?
Even without the sparse kernels, DSSTNE's management of sparse data beats TensorFlow because I believe the latter relies of cuSparse and it is performing a signficant amount of uploading of data. Evidence: a Grid K520 GPU is ~2.5x faster with DSSTNE, but on a GTX TitanX DSSTNE is nearly 6x faster than Tensorflow on GTX TitanX.
This means that TensorFlow is bottlenecking on something that didn't improve between those 2 GPUs.
Interestingly, the data sets for which DSSTNE was developed were 0.1% dense or less. Movielens 20M is approximately 0.4% dense. At some higher density, cuSparse should ultimately beat DSSTNE's kernels, but I'm not yet sure where the crossover point lies.
Is there anything more to do with this issue? If not, can it be closed?