Performance Issue. 28.1% less inference time on demo case with a simple change.
Bug
... This is a report of an observed performance issue. By fixing this, the demo case from the home page:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
can enjoy 28.1% less total kernel execution time on the Nvidia RTX 3090.
Steps to reproduce
... I installed Docling on Ubuntu 2204 conda virtual env through the official guide. The (above) given case will pass inference data through class TMTransformerDecoderLayer(nn.TransformerDecoderLayer): code-dir in my PC: /envs/docling/lib/python3.11/site-packages/docling_ibm_models/tableformer/models/table04_rs/transformer_rs.py If you go down to lines 96 and 107.
# From PyTorch but modified to only use the last tag
tgt_last_tok = tgt[-1:, :, :]
tmp_tgt = self.self_attn(
tgt_last_tok,
tgt,
tgt,
attn_mask=None, #None, because we only care about the last tag
key_padding_mask=tgt_key_padding_mask, is_docling = True,
) [0] // 97
tgt_last_tok = tgt_last_tok + self.dropout1(tmp_tgt)
tgt_last_tok = self.norm1(tgt_last_tok)
if memory is not None:
with proton.scope("transformer_rs"):
tmp_tgt = self.multihead_attn(
tgt_last_tok,
memory,
memory,
attn_mask=memory_mask,
key_padding_mask=memory_key_padding_mask, is_docling = True,
) [0] // 107
tgt_last_tok = tgt_last_tok + self.dropout2(tmp_tgt)
tgt_last_tok = self.norm2(tgt_last_tok)
They only use the first one out of three return values, which means that 4 out of 6 computations are wasted.
Talking about the root: the above code will invoke pytorch-nn-functional-_in_projection_packed
But only uses q_proj and discards kv_proj[0] and kv_proj[1].
To speed up, you can create a function, basically copy the pytorch code and comment out lines from 5726 to 5735. By doing so, you save time from 1 nn.Linear and 1 contiguous() copy operation.
def docling_in_projection_packed(
q: Tensor,
k: Tensor,
v: Tensor,
w: Tensor,
b: Optional[Tensor] = None,
) -> list[Tensor]:
E = q.size(-1)
if k is v:
if q is k:
# self-attention
proj = linear(q, w, b)
# reshape to 3, E and not E, 3 is deliberate for better memory coalescing and keeping same order as chunk()
proj = (
proj.unflatten(-1, (3, E))
.unsqueeze(0)
.transpose(0, -2)
.squeeze(-2)
.contiguous()
)
return proj[0], proj[1], proj[2]
else:
# encoder-decoder attention
w_q, w_kv = w.split([E, E * 2])
if b is None:
b_q = b_kv = None
else:
b_q, b_kv = b.split([E, E * 2])
q_proj = linear(q, w_q, b_q)
_**# CODE REMOVAL**_
return (q_proj, None, None)
else:
w_q, w_k, w_v = w.chunk(3)
if b is None:
b_q = b_k = b_v = None
else:
b_q, b_k, b_v = b.chunk(3)
return linear(q, w_q, b_q), linear(k, w_k, b_k), linear(v, w_v, b_v)
Docling version
... commit: bfcab3d6778e6f622bb4a6b241bdb4bab22ba378
Python version
... python=3.11.9
@David-Dingle Very interesting findings! We will have a look into it as soon as we can.
Dear @cau-git @maxmnemonic or @nikos-livathinos,
Any follow-up? This one looks very interesting to me.
@David-Dingle can you explain how you achieved the 28% less inference time? How did you measure and on which hardware?
If you could post a PR to docling-ibm-models with your proposed change applied, I would be happy to review.
Thanks.
@cau-git Hello, Christoph.
The saved time from kernel execution is measured by the Triton Profiler on the Nvidia RTX 3090.
I observed that the above-mentioned code block only utilizes 1/3 of the return values from the built-in PyTorch nn function.
One way to achieve the speed up without modifying the module architecture is to imitate the _in_projection_packed function and create your own. In the new function, you can remove all computations related to "kv_proj".
Thank you for your attention. I have enjoyed Docling so much~
@David-Dingle can you explain how you achieved the 28% less inference time? How did you measure and on which hardware?
If you could post a PR to
docling-ibm-modelswith your proposed change applied, I would be happy to review.Thanks.
@cau-git Hi Christoph. Is it possible to witness the performance difference on your side (by removing codes in https://github.com/pytorch/pytorch/blob/fdadda21b6ca88eede54930ae58278cd1f67e944/torch/nn/functional.py#L5736)