Question about the latency speedup!
Hi,
Thanks for the great work! I am curious about whether you will provide the script to get the end-to-end inference latency on a single GPU for the Llama family models?
Thanks, Yang
I am adding this to my TODO lists for this repo. Not sure when i can get back on this. But feel free to check out this blog post on end to end speedup evaluation of huggingface Transformer models with structured sparsity.
Hi @Eric-mingjie . I try to benchmark the efficiency gain owing to the sparsity. However, i found that sparse matmul seems to be slower than dense matmul.
sparsity_ratio = 0.5
linear = torch.nn.Linear(1024, 3072, bias=False).float().cpu().eval()
sort_res = torch.sort(linear.weight, dim=-1, stable=True)
indices = sort_res[1][:,:int(linear.weight.shape[1]* sparsity_ratio)]
mask = (torch.zeros_like(linear.weight) == 1)
mask.scatter_(1, indices, True)
x = torch.rand(3072, 1024).float().cpu()
with torch.inference_mode():
start = time.time()
dense_output = linear(x)
print(f"Dense linear {(time.time() - start) * 1000} ms")
# convert to sparse format
linear.weight = torch.nn.Parameter(linear.weight.to_sparse_csr())
start = time.time()
sparse_output = torch.sparse.mm(linear.weight, x.t()).t()
# sparse_output = linear(x)
print(f"Sparse linear {(time.time() - start) * 1000} ms")
# sparse and dense matmul are numerically equivalent
assert torch.allclose(sparse_output, dense_output, atol=1e-3)
Runing the above codes yields the following outputs:
Dense linear 13.79251480102539 ms
Sparse linear 155.81130981445312 ms
The sparse matmul is about 10x slower. Do you have any ideas on this phenomenon?
Did you set the sparse kernel in torch.sparse as they did here https://pytorch.org/tutorials/prototype/semi_structured_sparse.html?
import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.utils.benchmark import Timer
SparseSemiStructuredTensor._FORCE_CUTLASS = True
Did you set the sparse kernel in
torch.sparseas they did here https://pytorch.org/tutorials/prototype/semi_structured_sparse.html?import torch from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor from torch.utils.benchmark import Timer SparseSemiStructuredTensor._FORCE_CUTLASS = True
Nope. I intend to use the sparse matmul on cpu device. The to_sparse_semi_structured seems to be designated for GPUs. BTW, after extensive analysis (link), a initial conclusion is: the dense matmul has been optimized a lot, hence the sparse matmul is advantageous iff. the sparsity ratio is large enough (like reaching above 90%).
In this regard, weight sparsity shall be friendly to memory usage alone, and do not help improve the throughput.