pylatexenc
pylatexenc copied to clipboard
Incorrect (?) parsing of content in `lstlisting` environment
Hello, I have though a little bit more about the issue I described in https://github.com/phfaist/pylatexenc/issues/89. While I definitely understand and respect your reasoning I have come to the conclusion that the issue is not related to nodelist_to_latex()
at all.
Consider the following example:
#!/usr/bin/env python3
from pylatexenc.latexwalker import LatexWalker
w = LatexWalker(
r"""
\begin{lstlisting}
$ some content that should not be interpreted as tex!
\end{lstlisting}
""",
tolerant_parsing=False,
)
(nodelist, pos, len_) = w.get_latex_nodes(pos=0)
print(nodelist)
It is important to note the presence of tolerant_parsing=False
. Executing this code leads to the following error message:
Traceback (most recent call last):
File "/tmp/mwe/./mwe.py", line 13, in <module>
(nodelist, pos, len_) = w.get_latex_nodes(pos=0)
File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2376, in get_latex_nodes
r_endnow = do_read(nodelist, p)
File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2245, in do_read
(envnode, epos, elen) = self.get_latex_environment(
File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 1834, in get_latex_environment
(nodelist, npos, nlen) = self.get_latex_nodes(pos,
File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2376, in get_latex_nodes
r_endnow = do_read(nodelist, p)
File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2187, in do_read
(mathinline_nodelist, mpos, mlen) = self.get_latex_nodes(
File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2376, in get_latex_nodes
r_endnow = do_read(nodelist, p)
File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2121, in do_read
raise LatexWalkerParseError(
pylatexenc.latexwalker.LatexWalkerParseError: Unexpected closing environment: 'lstlisting' @(4,0)
Open LaTeX blocks:
@(2,0) begin environment "lstlisting"
@(3,0) math mode "$"
If tolerant_parsing
is set to True
, the script will successfully complete execution and return the following output:
[LatexCharsNode(parsing_state=<parsing state 140426563620000>, pos=0, len=1, chars='\n'),
LatexEnvironmentNode(parsing_state=<parsing state 140426563620000>, pos=1, len=90, environmentname='lstlisting', nodelist=
[LatexCharsNode(parsing_state=<parsing state 140426563620000>, pos=19, len=1, chars='\n'), LatexMathNode(parsing_state=
<parsing state 140426563620000>, pos=20, len=71, displaytype='inline', nodelist=[LatexCharsNode(parsing_state=<parsing state
140426563621104>, pos=21, len=53, chars=' some content that should not be interpreted as tex!\n'),
LatexCharsNode(parsing_state=<parsing state 140426563621104>, pos=90, len=1, chars='\n')], delimiters=('$', '$'))],
nodeargd=ParsedMacroArgs(argspec='', argnlist=[]))]
Here, it can be seen that a LatexMathNode
is created inside of the lstlisting
environment. In my opinion, this is not correct behavior since content inside of the lstlisting
environment should be treated as verbatim (one big LatexCharsNode
(?)). The construction of the unbalanced LatexMathNode
is what eventually causes nodelist_to_latex()
to produce invalid latex code but it is, in my opinion, not the root of the problem. Similarly, in the second example in https://github.com/phfaist/pylatexenc/issues/89 the sequence \"
gets turned into a LatexMacroNode
.
Substituting the lstlisting
environment for a verbatim environment instead leads to correct results.
Again, thank you very much for your time!
Digging a little bit into the source it appears that there is currently no special handling for the lstlisting
environment as there is for verbatim
. Should VerbatimArgsParser
be adapted to handle more environments in addition to verbatim
?
Edit: A quick change which causes lstlisting
to be handled in a similar way as the verbatim
environment:
From 94780a144e1f79ec17bfede62e0f3b10af3e65c9 Mon Sep 17 00:00:00 2001
From: David Brown <[email protected]>
Date: Fri, 14 Oct 2022 09:54:54 +0200
Subject: [PATCH] Handle lstlisting as verbatim environment
---
.../pylatexenc/latexwalker/_defaultspecs.py | 17 ++++++++++++++
.../pylatexenc/macrospec/_argparsers.py | 23 +++++++++++++++++++
2 files changed, 40 insertions(+)
diff --git a/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/_defaultspecs.py b/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/_defaultspecs.py
index 3306a89..a8f6e24 100644
--- a/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/_defaultspecs.py
+++ b/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/_defaultspecs.py
@@ -308,6 +308,23 @@ specs = [
# for verbatim |\like \this|...
]}),
+
+ #
+ # CATEGORY: listings
+ #
+ ('listings', {
+ 'macros': [
+ # TODO: Maybe handle lstinline as well?
+ ],
+ 'environments': [
+ EnvironmentSpec('lstlisting',
+ args_parser=VerbatimArgsParser(verbatim_arg_type='lstlisting-environment')),
+ ],
+ 'specials': [
+ # optionally users could include the specials "|" like in latex-doc
+ # for verbatim |\like \this|...
+ ]}),
+
#
# CATEGORY: theorems
#
diff --git a/venv/lib/python3.10/site-packages/pylatexenc/macrospec/_argparsers.py b/venv/lib/python3.10/site-packages/pylatexenc/macrospec/_argparsers.py
index 5d06724..39b311a 100644
--- a/venv/lib/python3.10/site-packages/pylatexenc/macrospec/_argparsers.py
+++ b/venv/lib/python3.10/site-packages/pylatexenc/macrospec/_argparsers.py
@@ -451,6 +451,29 @@ class VerbatimArgsParser(MacroStandardArgsParser):
)
return (argd, pos, len_)
+ if self.verbatim_arg_type == 'lstlisting-environment':
+ # simply scan the string until we find '\end{lstlisting}'. That's
+ # exactly how LaTeX processes it.
+ endverbpos = w.s.find(r'\end{lstlisting}', pos)
+ if endverbpos == -1:
+ raise latexwalker.LatexWalkerParseError(
+ s=w.s,
+ pos=pos,
+ msg=r"Cannot find matching \end{lstlisting}"
+ )
+ # do NOT include the "\end{lstlisting}", latexwalker will expect to
+ # see it:
+ len_ = endverbpos-pos
+
+ argd = ParsedVerbatimArgs(
+ verbatim_chars_node=w.make_node(latexwalker.LatexCharsNode,
+ parsing_state=parsing_state,
+ chars=w.s[pos:pos+len_],
+ pos=pos,
+ len=len_)
+ )
+ return (argd, pos, len_)
+
if self.verbatim_arg_type == 'verb-macro':
# read the next nonwhitespace char. This is the delimiter of the
# argument
--
2.37.2
I am not sure if this is the correct way to implement this. It does, however, now lead to the correct behavior and has the nice side effect that nodelist_to_latex()
returns the expected content :)
Note: This does not handle other macros or environments provided by the listings
package (e.g. lstinline
).
Thanks for the careful analysis! I'll have a closer look at some point soon.
An update here — the lstlisting
environment is now recognized correctly in pylatexenc
version 3.0alpha
. I am still likely to update the way the contents is reported though. I'm aiming to have verbatim environments report their contents with their environment body nodelist containing a single LatexCharsNode with the verbatim content (which might include characters that could have a special meaning in LaTeX). This would take advantage of the new parsing features of pylatexenc 3
.