MS-Word support integral sign
Bug
Traceback (most recent call last): File "/Users/bytedance/rag/opensource/docling/docling/pipeline/base_pipeline.py", line 72, in execute conv_res = self._build_document(conv_res) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/bytedance/rag/opensource/docling/docling/pipeline/simple_pipeline.py", line 40, in _build_document conv_res.document = conv_res.input._backend.convert() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/bytedance/rag/opensource/docling/docling/backend/msword_backend.py", line 153, in convert doc, _ = self._walk_linear(self.docx_obj.element.body, doc) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/bytedance/rag/opensource/docling/docling/backend/msword_backend.py", line 345, in _walk_linear te = self._handle_text_elements(element, doc) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/bytedance/rag/opensource/docling/docling/backend/msword_backend.py", line 889, in _handle_text_elements text, equations = self._handle_equations_in_text( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/bytedance/rag/opensource/docling/docling/backend/msword_backend.py", line 829, in _handle_equations_in_text latex_equation = str(oMath2Latex(subt)).strip() ^^^^^^^^^^^^^^^^^ File "/Users/bytedance/rag/opensource/docling/docling/backend/docx/latex/omml.py", line 189, in init self._latex = self.process_children(element) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/bytedance/rag/opensource/docling/docling/backend/docx/latex/omml.py", line 120, in process_children return BLANK.join( ^^^^^^^^^^^ File "/Users/bytedance/rag/opensource/docling/docling/backend/docx/latex/omml.py", line 121, in
( File "/Users/bytedance/rag/opensource/docling/docling/backend/docx/latex/omml.py", line 100, in process_children_list t = self.call_method(_e, stag=stag) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/bytedance/rag/opensource/docling/docling/backend/docx/latex/omml.py", line 86, in call_method return method(self, elm) ^^^^^^^^^^^^^^^^^ File "/Users/bytedance/rag/opensource/docling/docling/backend/docx/latex/omml.py", line 387, in do_nary return bo + BLANK.join(res) ~~~^~~~~~~~~~~~~~~~~ TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Steps to reproduce
Just run DocumentConverter().convert(file) with these file integral-sign.docx which only contain $\int_{b}^{a}x$.
Or you can follow this tutorial to generate MS-word file with integral sign formula.
Docling version
The latest main branch with commit at Nov 27 2025.
Python version
3.11.12
The TypeError happens because the OMML-to-LaTeX conversion for integrals in do_nary tries to concatenate a None value (the integral symbol, bo) with a string. This occurs when the OMML element for the integral is missing the expected chr attribute, so get_val(t.chr, store=CHR_BO) returns None instead of a string. There are no safeguards in do_nary to handle this, so Python raises an error when it tries to do bo + BLANK.join(res) with bo as None citation.
To fix this, update do_nary to default bo to an empty string if get_val returns None, and filter out any None values in res before joining. For example:
bo = get_val(t.chr, store=CHR_BO) or ""
...
return bo + BLANK.join([x for x in res if x is not None])
As a quick workaround, you can also wrap the LaTeX conversion call in a try/except block where equations are processed (e.g., in _handle_equations_in_text), so if an exception occurs, you insert a placeholder like <formula skipped> and continue processing citation.
This bug is known and has appeared in other equation conversion contexts in Docling citation, so making the code more robust to missing or malformed OMML attributes is recommended.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
To fix this, update
do_naryto defaultboto an empty string ifget_valreturnsNone, and filter out anyNonevalues inresbefore joining. For example:
Exception bypassing isn't the goal. Replacing with empty string will result in a integral sign missing latex _{b}^{a}x, as expect is \int_{b}^{a}x.
You're right—defaulting to an empty string just hides the error and drops the integral sign, which is not correct. The root cause is that the OMML parser expects the naryPr/chr attribute to match a Unicode codepoint in the CHR_BO dictionary (like \u222b for \int), but Word can emit other variants (double, triple, contour, etc.) that aren't mapped. When that happens, get_val returns None, so the LaTeX output loses the integral symbol and triggers the TypeError when concatenating None with a string.
A robust fix is to expand CHR_BO in latex_dict.py to include all relevant Unicode codepoints for integral and related big operators that Word might use (e.g., double, triple, contour integrals). In do_nary, if get_val returns None, you should try to infer the correct LaTeX symbol—either by falling back to \int if the context is an integral, or by using the literal Unicode symbol as a last resort. This way, you avoid both the error and the missing symbol, and you get the expected LaTeX like \int_{b}^{a}x instead of just _{b}^{a}x see source.
Filtering out None values in res is still good practice, but the key is to ensure bo always gets a valid LaTeX operator string. This approach makes the conversion robust to OMML variants and preserves the integral sign as expected see source.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
I found a python module officemath2latex can handle these. You can review this project, then decide replace docling/backend/docx/latex/omml.py with it or not.
I can submit PR if you are willing to accept it.
@ealyn thanks for spotting this issue and the willingness to provide a solution.
I found a python module officemath2latex can handle these. You can review this project, then decide replace docling/backend/docx/latex/omml.py with it or not.
I don't think it's a good idea. Docling is an open-source project from the LF AI & Data Foundation and adding dependencies must be done with caution. The library you propose lacks an explicit open-source license, has no robust release history, no CI/CD pipeline, and shows limited adoption. These gaps make it risky to maintain such dependency.