simit
simit copied to clipboard
Inefficient Matrix Multiplication on GPU
The matrix multiplication code we emit is not tailored to GPU execution (no parallelism).
This will be fixed with the sparse tensor compilation theory, so we should consider leaving it to then. If someone really needs it we can add a custom solution.