wanda Question about the latency speedup!

Hi,

Thanks for the great work! I am curious about whether you will provide the script to get the end-to-end inference latency on a single GPU for the Llama family models?

Thanks, Yang

Oct 26 '23 08:10 ybai62868

I am adding this to my TODO lists for this repo. Not sure when i can get back on this. But feel free to check out this blog post on end to end speedup evaluation of huggingface Transformer models with structured sparsity.

Oct 27 '23 02:10 Eric-mingjie

Hi @Eric-mingjie . I try to benchmark the efficiency gain owing to the sparsity. However, i found that sparse matmul seems to be slower than dense matmul.

sparsity_ratio = 0.5

linear = torch.nn.Linear(1024, 3072, bias=False).float().cpu().eval()

sort_res = torch.sort(linear.weight, dim=-1, stable=True)
indices = sort_res[1][:,:int(linear.weight.shape[1]* sparsity_ratio)]
mask = (torch.zeros_like(linear.weight) == 1)
mask.scatter_(1, indices, True)

x = torch.rand(3072, 1024).float().cpu()

with torch.inference_mode():
    start = time.time()
    dense_output = linear(x)
    print(f"Dense linear {(time.time() - start) * 1000} ms")
    
    # convert to sparse format
    linear.weight = torch.nn.Parameter(linear.weight.to_sparse_csr())

    start = time.time()
    sparse_output = torch.sparse.mm(linear.weight, x.t()).t()
    # sparse_output = linear(x)
    print(f"Sparse linear {(time.time() - start) * 1000} ms")

    # sparse and dense matmul are numerically equivalent
    assert torch.allclose(sparse_output, dense_output, atol=1e-3)

Runing the above codes yields the following outputs:

Dense linear 13.79251480102539 ms
Sparse linear 155.81130981445312 ms

The sparse matmul is about 10x slower. Do you have any ideas on this phenomenon?

Nov 08 '23 08:11 llCurious

Did you set the sparse kernel in torch.sparse as they did here https://pytorch.org/tutorials/prototype/semi_structured_sparse.html?

import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.utils.benchmark import Timer
SparseSemiStructuredTensor._FORCE_CUTLASS = True

Nov 09 '23 21:11 Eric-mingjie

Did you set the sparse kernel in torch.sparse as they did here https://pytorch.org/tutorials/prototype/semi_structured_sparse.html?
import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.utils.benchmark import Timer
SparseSemiStructuredTensor._FORCE_CUTLASS = True

Nope. I intend to use the sparse matmul on cpu device. The to_sparse_semi_structured seems to be designated for GPUs. BTW, after extensive analysis (link), a initial conclusion is: the dense matmul has been optimized a lot, hence the sparse matmul is advantageous iff. the sparsity ratio is large enough (like reaching above 90%).

In this regard, weight sparsity shall be friendly to memory usage alone, and do not help improve the throughput.

Nov 10 '23 02:11 llCurious