sile icon indicating copy to clipboard operation
sile copied to clipboard

Paragraph indentation is lost after commands

Open pkulchenko opened this issue 9 years ago • 59 comments

This can be seen on this document:

\begin[class=book,papersize=129mm x 198mm]{document}
\script[src=packages/rules]
I have seldom heard him mention her under any other name.

\hrule[width=100mm,height=0.25mm]

More text here
\end{document}

More text here continues on the same line as the horizontal rule, but should start on a new line.

I've bisected it to this commit 269963ad, more specifically to lineEndLineStartSpace in command =((P("\\")-P("\\begin")) * Cg(myID, "tag") * Cg(parameters,"attr") * V"bracketed_stuff"^0 * lineEndLineStartSpace)-P("\\end{").

pkulchenko avatar Jun 20 '15 00:06 pkulchenko

Urgh. Working out which spaces should and shouldn't be ignored after commands seems like it is beyond me. :-(

A\relax␠B

should render as A B. But

A\relax B

should render as AB, and (presumably) so should

A\relax B

and possibly even

A\relax␠ (additional space after \relax) B

But

A\relax B

should render as

A B

(I think.) If this is correct, and someone could put the rule into words, we can write tests for it. At the moment I'm just too confused. :-(

simoncozens avatar Jun 20 '15 01:06 simoncozens

I think the logic should be absolutely simple: \command\s*EOL should be removed. This means that for the purpose of formatting, if the line ends with command and some optional spaces, the command is removed (along with the spaces and the end-of-line) as if it never existed in the text.

I think it should handle nicely all the test cases you have in your example, but I tried to change lineEndLineStartSpace to (lpeg.S(" ")^0 * lpeg.S("\r\n"))^-1, which expresses what I'm looking for, but it still doesn't work. Any idea why?

pkulchenko avatar Jun 20 '15 02:06 pkulchenko

I think I got it; this is the diff against the current master:

diff --git a/core/inputs-texlike.lua b/core/inputs-texlike.lua
index 4805208..fe7eb11 100644
--- a/core/inputs-texlike.lua
+++ b/core/inputs-texlike.lua
@@ -14,7 +14,7 @@ SILE.inputs.TeXlike.parser = function (_ENV)
   local list = lpeg.Cf(lpeg.Ct("") * pair^0, rawset)
   local parameters = (P("[") * list * P("]")) ^-1 / function (a) return type(a)=="table" and a or {} end
   local anything = C( (1-lpeg.S("\\{}%\r\n")) ^1)
-  local lineEndLineStartSpace = (lpeg.S(" ")^0 * lpeg.S("\r\n")^1 * lpeg.S(" ")^0)^-1
+  local lineEndLineStartSpace = (lpeg.S(" ")^0 * lpeg.S("\r\n"))^-1

   START "document";
   document = V("stuff") * (-1 + E("Unexpected character at end of input"))
@@ -22,10 +22,10 @@ SILE.inputs.TeXlike.parser = function (_ENV)
   stuff = Cg(V"environment" +
     ((P("%") * (1-lpeg.S("\r\n"))^0 * lpeg.S("\r\n")^-1) /function () return "" end) -- Don't bother telling me about comments
     + V("text") + V"bracketed_stuff" + V"command")^0
-  bracketed_stuff = P"{" * V"stuff" * (P"}" + E("} expected"))
-  command =((P("\\")-P("\\begin")) * Cg(myID, "tag") * Cg(parameters,"attr") * V"bracketed_stuff"^0 * lineEndLineStartSpace)-P("\\end{")
+  bracketed_stuff = P"{" * V"stuff" * (P"}" + E("} expected")) * lineEndLineStartSpace
+  command =((P("\\")-P("\\begin")) * Cg(myID, "tag") * Cg(parameters,"attr") * V"bracketed_stuff"^0)-P("\\end{")
   environment =
-    P("\\begin") * Cg(parameters, "attr") * P("{") * Cg(myID, "tag") * P("}") * lineEndLineStartSpace
+    P("\\begin") * Cg(parameters, "attr") * P("{") * Cg(myID, "tag") * P("}")
       * V("stuff")
     * (P("\\end{") * (
       Cmt(myID * Cb("tag"), function(s,i,a,b) return a==b end) +

pkulchenko avatar Jun 20 '15 02:06 pkulchenko

It isn't just paragraph breaks that are a problem here. There is a problem with trailing commands on all lines:

two
words

…typesets as "two words" whether or not you include a trailing space on the first line. On the other hand:

\em{two}
words

…typesets crammed together as "_two_words", again whether or not you manually include any spaces after the command.

See one of the Greek reader page samples from @jtauber for an example of this being a problem.

alerque avatar Jun 20 '15 07:06 alerque

Good point. @simoncozens, is it possible to identify those commands that produce any output in the document? Maybe, as a post-processing step, drop empty lines that had commands on them?

The logic would have only 3 steps:

  1. Drop any trailing whitespaces after a command (but not the new line)
  2. Capture the fact that the line has a command
  3. Delete any empty line if it had a command

It can be simplified further by removing not just empty lines, but also whitespace only lines that include commands, but I'm not sure about all implications. One good aspect is that it would make _\command, \command, and \command_ work the same, but I'm not sure if there are any cases where the difference may be significant.

pkulchenko avatar Jun 20 '15 17:06 pkulchenko

@alerque, I'm still thinking about this and it may be the case that \em{two}\nwords behavior is fine. It's difficult for me to say, but the patch that I shown earlier handles the cases that Simon listed exactly as he wanted to.

There are several easy workarounds: (1) add a space before words, (2) put those on one line, or (3) add a comment after \em: \em{two} %. Each of those should render two and words separately.

I'd prefer to handle this case "automatically", but it may not be possible without deeper analysis.

@simoncozens, thoughts?

pkulchenko avatar Jun 30 '15 22:06 pkulchenko

Right now my thoughts are I would like to rip out the changes I made and return to the old, you-need-to-add-percent-signs-everywhere parser. It was less user-friendly but at least it was consistent and unsurprising...

simoncozens avatar Jul 01 '15 06:07 simoncozens

:cry: Having more %s floating around than even LaTeX needed was not a happy state of affairs.

alerque avatar Jul 01 '15 09:07 alerque

It was less user-friendly but at least it was consistent and unsurprising...

I wouldn't call it unsurprising as I was quickly bitten by text\n\command\nmore text not being in the same paragraph (and I'm quite familiar with LaTeX). I have a habit of putting commands (especially with a list of parameters on a separate line), which breaks paragraphs with the current logic.

Having more %s floating around than even LaTeX needed was not a happy state of affairs.

I'm with Caleb on this one. This is one of the things I'd prefer to see less in my text/code.

@simoncozens, I'm proposing a very simple rule that should (1) handle the examples you listed, (2) handle those examples that Caleb mentioned, and (3) be simple to implement:

When a line starts with a command without any leading spaces, treat the previous new line as a space unless it's an empty line (a line without non-whitespace characters). That's it.

This means that text \command and text\n\command will be rendered the same, which is what the user would want in most (all?) cases. The "unless it's an empty line" clause is needed to avoid merging the line into the first paragraph in the following:

Some text

\hrule[width=100mm,height=0.25mm]

More text

pkulchenko avatar Jul 01 '15 16:07 pkulchenko

I've spent half a day today trying to figure out what exactly the problem is and why the current solution is not working.

After trying the various examples and fixes, I came full circle. I agree with @simoncozens that the earlier functionality needs to be restored (revert 269963ad and 069205f2).

If the functionality is restored, there are several "features" that (ideally) needs to be fixed:

  • several commands before a paragraph change indentation of that paragraph. This is the issue described in #54.
  • several commands with an empty line produce a standalone (empty) paragraph.
  • several commands at the top of the document with an empty line produce a standalone (empty) paragraph (this and the previous item are the same issue, just caused by different scenarios).

Examples:

\begin[class=book,papersize=129mm x 198mm]{document}
\script[src=packages/color]
\script[src=packages/rules]%\hrule[width=10mm,height=0.25mm]
\script[src=packages/image]

First word, first paragraph.

\raggedright
\raggedright
Last paragraph
\end{document}

I've tried various fixes, but I don't think any of them are going to work simply because there are commands that need to ignore spaces and other commands that don't need to and the parser doesn't have that information. For example, in the example above \hrule may need to be rendered and if there are two \hrule commands, then the whitespaces between them become significant.

Even the rendering of that \hrule is not without flaws as the previous command includes a whitespace, so the rule is rendered with a small indent. Adding % to the second line (with \script[src=packages/color]) fixes the issue.

If it's not possible to come up with a rule that is going to work automatically, I think we may need to introduce a new command that will mark a set of commands and "eat" all whitespaces. This is still better than % as it's going to be applied to all inserted commands and everything else in between.

I'm proposing \cmd and \cmd{}. The former will ignore all whitespaces from this command to the end of the paragraph or the first non-empty text (whichever comes first). The latter will ignore whitespaces inside its body.

I've tried to implement this command based on font command, but I couldn't figure out how to walk content to remove/ignore whitespaces inside:

    SILE.registerCommand("cmd", function(options, content)
      if (type(content)=="function" or content[1]) then
        SILE.settings.pushState()
        SILE.process(content)
        SILE.settings.popState()
      end
    end, "Processes commands while ignoring whitespaces")

Any help?

pkulchenko avatar Jul 02 '15 05:07 pkulchenko

there are commands that need to ignore spaces and other commands that don't need to and the parser doesn't have that information

I've been playing with scenarios myself and arrived at much the same conclusion you did. The parser needs to be fed more information. It's a crazy hard problem, but however it is solved I want a syntax I can actually teach to non-programmers and have them understand what the deal is. Trailing comment operators to solve a white space parsing issue seems like it will be a never ending debacle to explain.

The \cmd{} (or perhaps a more descriptive name like\silent{} or \nooutput{}) makes a bit of sense and I agree is an improvement over trailing %. That would also allow for block syntax to be used, ala <head> it HTML or everything before \begin{document} in LaTeX without worrying about throwing off the document layout with some extra spaces before the content.

\begin{cmdonly}
    \stuff
    \here
\end{cmdonly} 

(Still trying on command names for size there.)

Another possibility that occurred to be is simply an alternate syntax for all commands. If the idea is that the parser doesn't have enough information, some structural part of the syntax could act as a flag for whether or not any given command should or should not (and perhaps how) eat whitespace.

alerque avatar Jul 02 '15 05:07 alerque

The \cmd{} (or perhaps a more descriptive name like\silent{} or \nooutput{}) makes a bit of sense and I agree is an improvement over trailing %. That would also allow for block syntax to be used, ala

it HTML or everything before \begin{document} in LaTeX without worrying about throwing off the document layout with some extra spaces before the content.

Right; I also played with \quiet, \preamble and few others; I prefer \cmd because (1) it's similar to \script and other SILE tags (it's a container for commands) and (2) it's short.

I didn't like \quiet, \silent, or \nooutput, because this command doesn't suppress output from commands; it only removes whitespaces between commands.

\trim is another option as it trims the result ("to make (something) neat by cutting it"), but \cmd may be more similar to the philosophy behind SILE tags.

Another possibility that occurred to be is simply an alternate syntax for all commands. If the idea is that the parser doesn't have enough information, some structural part of the syntax could act as a flag for whether or not any given command should or should not (and perhaps how) eat whitespace.

I prefer a specific tag, simply because it limits the number of commands and because it eats whitespaces between commands, which would still be difficult to explain by having alternative syntax.

I got it working as a container \cmd{}, but I also want to get it working as \cmd (to the beginning of the text), similar to how \font works and can't figure out how to do it properly yet.

pkulchenko avatar Jul 02 '15 15:07 pkulchenko

@simoncozens, @alerque, I've implemented \cmd command in both \cmd and \cmd{} variants; pushed to https://github.com/pkulchenko/sile/tree/command-friendly-whitespace. This branch also includes a revert of the two WIP commits in SILE.

Simon, I simply hacked massage_ast as I couldn't make it work with command handlers for \cmd (as it needs to be applied for subsequent elements that are not included in content passed to the command handler). I'm sure you have some mechanism for traversing the tree, but I couldn't find it, so for ended up using the AST for testing purposes.

It achieves all the effects I wanted to; the following example works as I expect it:

\begin[class=book,papersize=129mm x 198mm]{document}\cmd
\script[src=packages/color]
\script[src=packages/rules]%\hrule[width=10mm,height=0.25mm]
\script[src=packages/image]

First word, first paragraph.

\cmd{
\raggedright
\raggedright}

Last paragraph
\end{document}

The first \cmd (on the first line) eliminates incorrect vertical spacing and indentation for the first paragraph. The second \cmd{} eliminates vertical spacing between paragraphs.

If you want to see the "original/current" behavior, simply comment out line 97 in inputs-texlike.lua.

I played with making it enabled by default, but don't think it's a good idea.

pkulchenko avatar Jul 02 '15 18:07 pkulchenko

Simon, what do you think about the \cmd proposal?

pkulchenko avatar Jul 21 '15 05:07 pkulchenko

@simoncozens, @alerque, any further thoughts on this? The current state is sort of the middle of not-quite-tex-like and not-quite-working-in-some-cases. The \cmd proposal would revert this to the original behavior (which is expected result for those familiar with TeX), but provide a way to achieve proper formatting around commands without % galore.

pkulchenko avatar Aug 05 '15 17:08 pkulchenko

Hi Paul, sorry about the delay; I've been prioritizing a lot of the internationalisation work and stuff with discretionaries - now that's sorted and the tests pass again I'd like to come back to this. I think the \cmd proposal is the way forward - revert the misguided changes to the parser and then introduce a new process function which doesn't pass stuff to the typesetter at top level.

simoncozens avatar Aug 05 '15 17:08 simoncozens

This sounds good. You can check my attempt to do this (https://github.com/pkulchenko/sile/tree/command-friendly-whitespace) as it includes the revert of the changes (one commit), plus the new command (another commit); I'm sure you can find a proper way to implement the needed logic ;). I've tried to make it work with both \cmd and \cmd{} and would like to have both variants available; the former is a shortcut that works for one paragraph (as far as I remember).

pkulchenko avatar Aug 05 '15 18:08 pkulchenko

I've come up with another way to do this which I think works OK. Basically we revert the two erroneous changes and introduce a new rule: text which consists entirely of newlines is dropped. Please have a look at the latest commit, especially the new test tests/space-after-command.sil. If this works the way everyone expects, I will fix up the regression test expectations.

simoncozens avatar Aug 07 '15 10:08 simoncozens

Hmmmm, that's very interesting. So far all the random tests I through at it did the thing I expected.

alerque avatar Aug 07 '15 10:08 alerque

Seeing as how you've jinxed me twice today I'm asking: @simoncozens are you working on the tests or would you like me to fix them up? I have a previous test related commit to send and can send it with or without updated regression tests for this issue.

alerque avatar Aug 07 '15 11:08 alerque

I wasn't going to fix up the tests for this yet - was looking for a third opinion from @pkulchenko before declaring this done.

simoncozens avatar Aug 07 '15 11:08 simoncozens

Alright well I have a fixed up set (all but one that I'm not sure about) that I'll post in a minute but you can wait to merge it until we decide on this. If it turn out we do something different I'll split out the non-whitespace related changes from the set.

alerque avatar Aug 07 '15 11:08 alerque

One case where this handling is broken (relative to previous behavior) is in the \pagetemplate command. Indenting such as:

\pagetemplate[…]{
    \frame[…]
}

…leaves blank space at the top of a document where it didn't used to while

\pagetemplate[…]{
\frame[…]
}

…works as expected.

alerque avatar Aug 07 '15 11:08 alerque

I toyed with the idea of never typesetting anything which is purely whitespace, but then you have the (very useful) \book:section:post and similar macros which are defined as being a space. \book:sectioning\book:section:post{}Section name should insert a space. So that wouldn't work.

I think the way around this is to give \pagetemplate and other purely structural commands their own version of SILE.process that never sends any text to the typesetter.

simoncozens avatar Aug 07 '15 11:08 simoncozens

@simoncozens, this is an interesting change. I should be able to test it on my examples shortly, but one clarification first; you said "we revert the two erroneous changes", but I only see one change to remove * _. Is it as intended or do you plan more changes?

pkulchenko avatar Aug 07 '15 13:08 pkulchenko

See also 247ad19.

simoncozens avatar Aug 07 '15 13:08 simoncozens

@pkulchenko I interpreted that as a reference to the fact that what got reverted was originally spread out over two commits.

Also if you have any example cases that are affected by this behavior I think we should flesh out the test to cover all the cases where this could make a difference to the output.

alerque avatar Aug 07 '15 13:08 alerque

@simoncozens, right, missed the first commit.

@alerque, absolutely; I'll run on few examples I have and post them here if anything interesting.

pkulchenko avatar Aug 07 '15 13:08 pkulchenko

@simoncozens, I tried to test, but I'm getting core/inputs-common.lua:13: attempt to index field 'masterFilename' (a nil value) on the latest master. This doesn't seem to be related to the changes we've been discussing; I bisected it to this @alerque's commit: 1688beadec49db3f043b915ff62c. Is it just me or do you see the same error? I'm simply running lua sile examples\simple.sil.

pkulchenko avatar Aug 07 '15 14:08 pkulchenko

Did you try rerunning ./configure to get a new sile executable? That bit me.

simoncozens avatar Aug 07 '15 14:08 simoncozens