`takeTill` performs exponential backtracking in environments
parseLaTeX performs exponential amounts of work in the number of environments. I have two files, book2.txt and book3.txt which are roughly the same size in bytes, but the latter has ~3x as many environments. book2.txt parses quickly, but I haven't been patient enough to get book3.txt to.
The shiatsu program below is doing nothing more than reading the file and calling parseLaTeX.
➜ /tmp/ebook$ wc book2.txt
6347 33541 245180 book2.txt
➜ /tmp/ebook$ cat book2.txt | grep "begin" | wc -l
232
➜ /tmp/ebook$ time shiatsu +RTS -p -RTS book2.txt - &> /dev/null
shiatsu +RTS -p -RTS book2.txt - &> /dev/null 2.23s user 0.10s system 99% cpu 2.343 total
➜ /tmp/ebook$ wc book3.txt
8385 41123 273877 book3.txt
➜ /tmp/ebook$ cat book3.txt | grep "begin" | wc -l
583
➜ /tmp/ebook$ time shiatsu +RTS -p -RTS book3.txt - &> /dev/null
^C
shiatsu +RTS -p -RTS book3.txt - &> /dev/null 43.29s user 0.33s system 99% cpu 43.780 total
Here's the profile information
COST CENTRE MODULE SRC %time %alloc
takeTill Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:354:1-48 74.5 76.6
peekChar Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:348:1-62 5.2 4.2
cmdArg Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(222,1)-(232,38) 4.9 4.7
dolMath Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(304,1)-(312,7) 2.2 2.7
envBody.endenv Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:189:9-75 2.0 2.0
latexBlockParser Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(116,1)-(123,5) 1.7 1.0
isSpecial Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:335:1-29 1.5 0.0
cmdArgs Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(217,1)-(219,30) 1.4 1.4
envName Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(180,1)-(185,21) 1.2 1.5
text2 Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(144,1)-(147,35) 0.8 1.0
env Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(163,1)-(177,22) 0.8 1.1
comment Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(323,1)-(328,23) 0.7 1.1
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
parseLaTeX Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:96:1-45 2793 0 0.0 0.0 100.0 100.0
parseLaTeXWith Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(99,1)-(101,63) 2794 1 0.0 0.0 100.0 100.0
latexParser Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:112:1-57 2796 0 0.0 0.0 100.0 100.0
latexBlockParser Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(116,1)-(123,5) 2799 0 1.6 1.0 100.0 100.0
text Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:(131,1)-(138,71) 2801 0 0.2 0.2 76.5 75.5
peekChar Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:348:1-62 2803 0 3.4 2.5 3.4 2.5
takeTill Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:354:1-48 2808 0 67.3 66.8 72.9 72.8
verbatimEnvironments Text.LaTeX.Base.Parser Text/LaTeX/Base/Parser.hs:72:5-24 3124 14881 0.0 0.0 0.0 0.0
Actually, I think this might be mostly my fault. Here's one of the environments I was trying to expand:
\begin{code}
main :: IO ()
main = do
bool <- read
withSomeSBool (toSBool bool) $ !\annotate{1}!
\(_ :: SBool b) -> !\annotate{2}!
runLogging @b program !\annotate{3}!
\end{code}
which is a \VerbatimEnvironment in latex. I hadn't added it to the list of verbatim environments in hatex---which meant a lot of backtracking was going into this, even though it could never succeed.
What is the result if you use the following parser configuration?
ParserConf { verbatimEnvironments = ["verbatim", "code"] }
The problem might be with the \(_ :: SBool b) bit. The parser might think \( is the beginning of a mathematical expression.
It works as expected with the given ParserConf.
Great! I'm glad that worked for you.
Is the example code above enough to break the parser using default configuration?
The example code above does break the parser as expected. However, in long documents, this broken parser seems to result in the backtracking issue described above---it will keep consuming input outside of the code environment trying to fix the parse.