wanda icon indicating copy to clipboard operation
wanda copied to clipboard

Question about the latency speedup!

Open ybai62868 opened this issue 2 years ago • 4 comments

Hi,

Thanks for the great work! I am curious about whether you will provide the script to get the end-to-end inference latency on a single GPU for the Llama family models?

Thanks, Yang

ybai62868 avatar Oct 26 '23 08:10 ybai62868

I am adding this to my TODO lists for this repo. Not sure when i can get back on this. But feel free to check out this blog post on end to end speedup evaluation of huggingface Transformer models with structured sparsity.

Eric-mingjie avatar Oct 27 '23 02:10 Eric-mingjie

Hi @Eric-mingjie . I try to benchmark the efficiency gain owing to the sparsity. However, i found that sparse matmul seems to be slower than dense matmul.

sparsity_ratio = 0.5

linear = torch.nn.Linear(1024, 3072, bias=False).float().cpu().eval()

sort_res = torch.sort(linear.weight, dim=-1, stable=True)
indices = sort_res[1][:,:int(linear.weight.shape[1]* sparsity_ratio)]
mask = (torch.zeros_like(linear.weight) == 1)
mask.scatter_(1, indices, True)

x = torch.rand(3072, 1024).float().cpu()

with torch.inference_mode():
    start = time.time()
    dense_output = linear(x)
    print(f"Dense linear {(time.time() - start) * 1000} ms")
    
    # convert to sparse format
    linear.weight = torch.nn.Parameter(linear.weight.to_sparse_csr())

    start = time.time()
    sparse_output = torch.sparse.mm(linear.weight, x.t()).t()
    # sparse_output = linear(x)
    print(f"Sparse linear {(time.time() - start) * 1000} ms")

    # sparse and dense matmul are numerically equivalent
    assert torch.allclose(sparse_output, dense_output, atol=1e-3)

Runing the above codes yields the following outputs:

Dense linear 13.79251480102539 ms
Sparse linear 155.81130981445312 ms

The sparse matmul is about 10x slower. Do you have any ideas on this phenomenon?

llCurious avatar Nov 08 '23 08:11 llCurious

Did you set the sparse kernel in torch.sparse as they did here https://pytorch.org/tutorials/prototype/semi_structured_sparse.html?

import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.utils.benchmark import Timer
SparseSemiStructuredTensor._FORCE_CUTLASS = True

Eric-mingjie avatar Nov 09 '23 21:11 Eric-mingjie

Did you set the sparse kernel in torch.sparse as they did here https://pytorch.org/tutorials/prototype/semi_structured_sparse.html?

import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.utils.benchmark import Timer
SparseSemiStructuredTensor._FORCE_CUTLASS = True

Nope. I intend to use the sparse matmul on cpu device. The to_sparse_semi_structured seems to be designated for GPUs. BTW, after extensive analysis (link), a initial conclusion is: the dense matmul has been optimized a lot, hence the sparse matmul is advantageous iff. the sparsity ratio is large enough (like reaching above 90%).

In this regard, weight sparsity shall be friendly to memory usage alone, and do not help improve the throughput.

llCurious avatar Nov 10 '23 02:11 llCurious