TexSoup
TexSoup copied to clipboard
Non-matching brackets not parsed correctly
Certain math notation involves non-matched brackets.
For example the set of nonnegative numbers is denoted $[0, \infty)$ in interval notation. TexSoup handle this notation fine on it's own but has trouble if there is command before it this non-matching expression, e.g. $S \cup [0, \infty)$.
tokenize(categorize(r"""$\cup [0, \infty)$"""))
GIVES
tokens= '$'_MathSwitch
'\'_Escape
'cup'_CommandName
' '_MergedSpacer
'['_BracketBegin
'0, '_Text
'\'_Escape
'infty'_CommandName
')'_Text
'$'_MathSwitch
I'm thinking it sees the command then the _BracketBegin so it starts to look for a closing bracket thinking these are optional arguments for the command \cup.
Here is the minimal failing test case:
def test_mixed_brackets():
"""Tests handling of math with non-matching bracket after a tex command."""
soup = TexSoup(r"""$(-\infty,0]$""") # works fine
soup = TexSoup(r"""$[0, \infty)$""") # works fine
# GH115
soup = TexSoup(r"""$S \cup [0, \infty)$""")
assert True
Oof yeah you're right.
The problem is that there could exist a space between commands and their arguments. Unfortunately, texsoup just needs to know how many args to expect, so the temporary solution is to add to this dictionary (something like cup: (0, 0) for 0 required, 0 optional args):
https://github.com/alvinwan/TexSoup/blob/51334866afa5033b3b6c6408ec2a8c5d69c32abe/TexSoup/reader.py#L29
I know cup is just an example; for a longer term solution, I was thinking of taking lists of operators/commands from lists like this one (bottom of the page) https://www.overleaf.com/learn/latex/Operators, and writing to some .conf or .yaml files that TexSoup comes prepackaged with. Thoughts? Was gonna do this in nov ish, after my next paper deadline.
Yeah the SIGNATURES approach seems like the way to go.
Here are some source code repos that might be a good place to get some signatures from macros:
- MathJax https://github.com/mathjax/MathJax-src/blob/master/ts/input/tex/base/BaseMappings.ts
- plasTeX https://github.com/plastex/plastex/blob/master/plasTeX/Base/LaTeX/Math.py
- specific for (0,0) signatures https://github.com/KaTeX/KaTeX/blob/master/src/symbols.js
No rush to fix this --- I found a workaround for the specific issue by rewriting as $S$ $\cup$ $[0, \infty)$ and it works.
Awesome, thanks for the second opinion. And siiick, thanks so much for digging those up. 🙇
For anyone else looking at this thread, I'll make sure to reference this issue when the PR is created.
Suggested workaround (adding cup: (0, 0) to SIGNATURES) seems not to work with equation environment:
TexSoup(r"""$ \cup [0, \infty)$""") # works fine
TexSoup(r""" \begin{equation} \cup [0, \infty) \end{equation}""") # fails
The exception TexSoup gives is:
<...>
TypeError: [Line: 0, Offset 23] Malformed argument. First and last elements must match a valid argument format. In this case, TexSoup could not find matching punctuation for: [.
Just finished parsing: ['[', '0, ', TexCmd('infty'), ') ', TexCmd('end', [BraceGroup('equation')])]
\left and \right also need to be added to SIGNATURES. I think both are (1,0).
A further example of non-matching brackets not parsing correctly:
'\\( [ \\infty [ \\)' fails ("TexSoup could not find matching punctuation for: [.")
Whereas
'\\( [ a [ \\)' and '[\\( \\left[ \\infty \\right[ \\) ]' are parsed correctly.
Note that in French, notation such as '[0,1[' for a half-open interval is common.