docling icon indicating copy to clipboard operation
docling copied to clipboard

Performance Issue. 28.1% less inference time on demo case with a simple change.

Open David-Dingle opened this issue 7 months ago • 1 comments

Bug

... This is a report of an observed performance issue. By fixing this, the demo case from the home page:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"

can enjoy 28.1% less total kernel execution time on the Nvidia RTX 3090.

Steps to reproduce

... I installed Docling on Ubuntu 2204 conda virtual env through the official guide. The (above) given case will pass inference data through class TMTransformerDecoderLayer(nn.TransformerDecoderLayer): code-dir in my PC: /envs/docling/lib/python3.11/site-packages/docling_ibm_models/tableformer/models/table04_rs/transformer_rs.py If you go down to lines 96 and 107.

       # From PyTorch but modified to only use the last tag
        tgt_last_tok = tgt[-1:, :, :]
        tmp_tgt = self.self_attn(
            tgt_last_tok,
            tgt,
            tgt,
            attn_mask=None,  #None, because we only care about the last tag
            key_padding_mask=tgt_key_padding_mask, is_docling = True,
        ) [0]    // 97
        tgt_last_tok = tgt_last_tok + self.dropout1(tmp_tgt)
        tgt_last_tok = self.norm1(tgt_last_tok)
        if memory is not None:
            with proton.scope("transformer_rs"):
                tmp_tgt = self.multihead_attn(
                    tgt_last_tok,
                    memory,
                    memory,
                    attn_mask=memory_mask,
                    key_padding_mask=memory_key_padding_mask, is_docling = True,
                ) [0]    // 107
            tgt_last_tok = tgt_last_tok + self.dropout2(tmp_tgt)
            tgt_last_tok = self.norm2(tgt_last_tok)

They only use the first one out of three return values, which means that 4 out of 6 computations are wasted.

Talking about the root: the above code will invoke pytorch-nn-functional-_in_projection_packed

But only uses q_proj and discards kv_proj[0] and kv_proj[1].

To speed up, you can create a function, basically copy the pytorch code and comment out lines from 5726 to 5735. By doing so, you save time from 1 nn.Linear and 1 contiguous() copy operation.

def docling_in_projection_packed(
    q: Tensor,
    k: Tensor,
    v: Tensor,
    w: Tensor,
    b: Optional[Tensor] = None,
) -> list[Tensor]:
    E = q.size(-1)
    if k is v:
        if q is k:
            # self-attention
            proj = linear(q, w, b)
            # reshape to 3, E and not E, 3 is deliberate for better memory coalescing and keeping same order as chunk()
            proj = (
                proj.unflatten(-1, (3, E))
                .unsqueeze(0)
                .transpose(0, -2)
                .squeeze(-2)
                .contiguous()
            )
            return proj[0], proj[1], proj[2]
        else:
            # encoder-decoder attention
            w_q, w_kv = w.split([E, E * 2])
            if b is None:
                b_q = b_kv = None
            else:
                b_q, b_kv = b.split([E, E * 2])
            q_proj = linear(q, w_q, b_q)
_**# CODE REMOVAL**_
            return (q_proj, None, None)
    else:
        w_q, w_k, w_v = w.chunk(3)
        if b is None:
            b_q = b_k = b_v = None
        else:
            b_q, b_k, b_v = b.chunk(3)
        return linear(q, w_q, b_q), linear(k, w_k, b_k), linear(v, w_v, b_v)

Docling version

... commit: bfcab3d6778e6f622bb4a6b241bdb4bab22ba378

Python version

... python=3.11.9

David-Dingle avatar May 05 '25 19:05 David-Dingle

@David-Dingle Very interesting findings! We will have a look into it as soon as we can.

cau-git avatar May 21 '25 12:05 cau-git

Dear @cau-git @maxmnemonic or @nikos-livathinos,

Any follow-up? This one looks very interesting to me.

pengfei-su avatar Jun 10 '25 02:06 pengfei-su

@David-Dingle can you explain how you achieved the 28% less inference time? How did you measure and on which hardware?

If you could post a PR to docling-ibm-models with your proposed change applied, I would be happy to review.

Thanks.

cau-git avatar Jun 18 '25 12:06 cau-git

@cau-git Hello, Christoph.

The saved time from kernel execution is measured by the Triton Profiler on the Nvidia RTX 3090.

I observed that the above-mentioned code block only utilizes 1/3 of the return values from the built-in PyTorch nn function.

One way to achieve the speed up without modifying the module architecture is to imitate the _in_projection_packed function and create your own. In the new function, you can remove all computations related to "kv_proj".

Thank you for your attention. I have enjoyed Docling so much~

David-Dingle avatar Jun 20 '25 20:06 David-Dingle

@David-Dingle can you explain how you achieved the 28% less inference time? How did you measure and on which hardware?

If you could post a PR to docling-ibm-models with your proposed change applied, I would be happy to review.

Thanks.

@cau-git Hi Christoph. Is it possible to witness the performance difference on your side (by removing codes in https://github.com/pytorch/pytorch/blob/fdadda21b6ca88eede54930ae58278cd1f67e944/torch/nn/functional.py#L5736)

David-Dingle avatar Jul 17 '25 17:07 David-Dingle