pandoc icon indicating copy to clipboard operation
pandoc copied to clipboard

Fix #4546 (carriage return in HTML)

Open link2xt opened this issue 7 years ago • 17 comments

link2xt avatar Apr 11 '18 09:04 link2xt

As I've noted on the linked issue, this will very likely have unexpected effects in various places, since all of the parsers have been developed under the assumption that the documents don't contain CRs. Changing the HTML reader is only a start to what would be required.

Hence I think it's too risky to merge this.

jgm avatar Apr 11 '18 17:04 jgm

@jgm

As I've noted on the linked issue, this will very likely have unexpected effects in various places, since all of the parsers have been developed under the assumption that the documents don't contain CRs.

All readers except CommonMark, Docx, EPUB, Native and Odt call crFilter by themselves immediately before processing the text. I mentioned it in the first commit message. Removing crFilter functionality from Text<->ByteString conversion functions just moves control over this issue to readers, so they can decide whether they filter out CRs or not.

No tests are broken by the way. If something breaks later, it is only a matter of adding crFilter to the affected reader, but looks like all the readers left don't need it. CommonMark uses external library, Docx and Odt are XML formats, EPUB is a container for HTML, and Native, as a serialization format, is better off without any kind of additional filtering.

link2xt avatar Apr 11 '18 20:04 link2xt

Ah, I'd forgotten that we'd moved crFilter inside the individual readers.

The other things to check would be other consumers of the UTF8 functions... For example, does the template engine assume that CRs have been stripped from templates? Will there be changes if you do a header-include?

Alexander [email protected] writes:

@jgm

As I've noted on the linked issue, this will very likely have unexpected effects in various places, since all of the parsers have been developed under the assumption that the documents don't contain CRs.

All readers except CommonMark, Docx, EPUB, Native and Odt call crFilter by themselves immediately before processing the text. I mentioned it in the first commit message. Removing crFilter functionality from Text<->ByteString conversion functions just moves control over this issue to readers, so they can decide whether they filter out CRs or not.

No tests are broken by the way. If something breaks later, it is only a matter of adding crFilter to the affected reader, but looks like all the readers left don't need it. CommonMark uses external library, Docx and Odt are XML formats, EPUB is a container for HTML and Native, as a serialization format, is better off without any kind of additional filtering.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/pull/4548#issuecomment-380589168

jgm avatar Apr 11 '18 23:04 jgm

@jgm

For example, does the template engine assume that CRs have been stripped from templates? Will there be changes if you do a header-include?

Just checked, templating engine does not seem to assume anything about CRs. Header-includes now pass CRs through. Before this PR CRs were filtered out from includes.

Isn't the new behaviour more correct? Why should CRs be removed from includes, as opposed to including them as-is?

link2xt avatar Apr 12 '18 13:04 link2xt

Alexander [email protected] writes:

@jgm

For example, does the template engine assume that CRs have been stripped from templates? Will there be changes if you do a header-include?

Just checked, templating engine does not seem to assume anything about CRs. Header-includes now pass CRs through. Before this PR CRs were filtered out from includes.

Isn't the new behaviour more correct? Why should CRs be removed from includes, as opposed to including them as-is?

I can't think of a reason why they should be stripped. I'm just trying to think of all the places this change might matter, and which we may not have thought of...

jgm avatar Apr 12 '18 15:04 jgm

Grep for toText and toString (which calls toText) does not return that many results, none of them looks like a place to insert crFilter for me.

link2xt avatar Apr 12 '18 16:04 link2xt

See the docs for System.IO.NewlineMode:

"Specifies the translation, if any, of newline characters between internal Strings and the external file or stream. Haskell Strings are assumed to represent newlines with the '\n' character; the newline mode specifies how to translate '\n' on output, and what to translate into '\n' on input."

So, I think part of the motivation for stripping CRs was to ensure that our strings conform to this invariant. (The newlines then get translated to CRLF on appropriate platforms, and a similar conversion is done on input -- see putStrLnWith in UTF8 module.)

Note that System.IO only supports CR and CRLF newlines.

So I'm not completely confident about the possible side effects of this change.

jgm avatar Apr 14 '18 16:04 jgm

Another option would be to change crFilter to something like

crFilter [] = []
crFilter ('\r':'\n':xs) = '\n':crFilter xs
crFilter ('\r':xs) = '\n':crFilter xs
crFilter (x:xs) = x:crFilter xs

One would want to benchmark this to make sure it's not much slower than the current filter.

jgm avatar Apr 14 '18 16:04 jgm

@jgm

See the docs for System.IO.NewlineMode ... So, I think part of the motivation for stripping CRs was to ensure that our strings conform to this invariant.

Text files are read via Text.Pandoc.UTF8.readFile. It calls System.IO.openFile, which opens a file in text mode, in which CRLF is translated to LF on Windows. The reason for stripping CRs manually is to handle the case when a file with CRLFs is read on *nix platform.

Another option would be to change crFilter to something like

Instead of making HTML reader parse CR as a newline (according to HTML spec) you propose to replace single CRs with LF for all text formats, so now they will all treat single CR as a newline, when they should not. And "\r\r\n" will be translated to "\r\n" on *nix and to "\n" on Windows.

Proper solution would be to read all files in binary mode (by replacing openFile with openBinaryFile) and replace CRLFs manually with

crFilter [] = []
crFilter ('\r':'\n':xs) = '\n':crFilter xs
crFilter (x:xs) = x:crFilter xs

Then, we will treat "\r\r\n" properly as a carriage return followed by newline and single CRs will survive conversion from text markup formats. If someone needs to preserve CRs, this change can be applied on top of this PR. It is not related to the problem: single CRs not being handled as a newline in HTML reader.

link2xt avatar Apr 14 '18 19:04 link2xt

Instead of making HTML reader parse CR as a newline (according to HTML spec) you propose to replace single CRs with LF for all text formats, so now they will all treat single CR as a newline, when they should not. And "\r\r\n" will be translated to "\r\n" on *nix and to "\n" on Windows.

No, the idea was that it would be translated to \n\n in both cases. That's what my revised crFilter function would do (even if the file was opened in text mode and the CRLF converted to LF).

You seem confident that it should be treated as \r\n, but why? I'd need to know more about where you'd actually find such a string in the wild, and why. I'm familiar with older OSs that use CR as line endings, but in what case would you find CRCRLF?

It is not related to the problem: single CRs not being handled as a newline in HTML reader.

Your PR contains a change to Text.Pandoc.UTF8, which potentially affects many things. It's not just a change to the HTML reader. So I don't want to go forward until the details have been understood and sorted out.

jgm avatar Apr 15 '18 00:04 jgm

@jgm

Note that System.IO only supports CR and CRLF newlines.

I assume it is a typo and read it as "LF and CRLF".

No, the idea was that it would be translated to \n\n in both cases. That's what my revised crFilter function would do (even if the file was opened in text mode and the CRLF converted to LF).

If CRLF is converted to LF (on Windows), "\r\r\n" will be converted to "\r\n" during reading, and crFilter will convert it to "\n". If CRLF not converted to LF, crFilter will convert "\r\r\n" to "\n\n".

You seem confident that it should be treated as \r\n, but why?

If we agree that only LF (Haskell and POSIX standard) and CRLF (for Windows compatibility) line endings are valid, then CRCRLF can be read either as CR<>newline or CRCR<>newline.

I'd need to know more about where you'd actually find such a string in the wild, and why. I'm familiar with older OSs that use CR as line endings, but in what case would you find CRCRLF?

CRCRLF is just a corner case I invented, in which case applying proposed crFilter will result in single CR being treated as a newline.

I don't care about CRCRLF case that much, but it worked correctly before my PR and works the same after my PR (unless we decide that we need to support CR newlines in all text formats).

Your PR contains a change to Text.Pandoc.UTF8, which potentially affects many things. It's not just a change to the HTML reader. So I don't want to go forward until the details have been understood and sorted out.

Ok, let me summarize.

System.IO has built-in "newline normalization", which translates CRLF in files to LF in process memory on Windows.

crFilter just removes CRs, it is its only function.

I assume that LF and CRLF are the only valid newlines. It is just HTML that treats single CRs differently, according to its standard.

Before PR it works like this:

  1. Text files are read with UTF8.readFile, which applies newline normalization, removes BOM and applies crFilter.
  2. All Text pandoc readers apply crFilter again for themselves, which is no-op because CRs are already removed.

After PR:

  1. Text files are read with UTF8.readFile, which applies newline normalization and removes BOM.
  2. All Text pandoc readers, except HTML reader, apply crFilter, so their behavior does not change.
  3. HTML reader does not apply crFilter and does its own processing of CRs according to HTML standard.

The changes are minimal because crFilters are built in every pandoc reader, their behavior does not change because of removing of crFilter from UTF8.readFile.

If we want not a minimal change, but more correct one, I propose that we change UTF8.readFile to read files in binary mode, remove BOM and replace all CRLFs with LF regardless of platform pandoc runs on. Then remove crFilter everywhere.

link2xt avatar Apr 15 '18 08:04 link2xt

@jgm Just checked CommonMark spec, it treats CR, LF and CRLF as line ending: http://spec.commonmark.org/0.28/#line-ending

It is one more reason to move crFilter out of readFile. Currently file with "foo\rbar" is read by CommonMark reader as "foobar". With my PR, it is read as [Para [Str "foo",SoftBreak,Str "bar"]].

link2xt avatar Apr 15 '18 10:04 link2xt

If we want not a minimal change, but more correct one, I propose that we change UTF8.readFile to read files in binary mode, remove BOM and replace all CRLFs with LF regardless of platform pandoc runs on. Then remove crFilter everywhere.

Or, even better: make UTF8.readFile read files in binary mode and remove BOM, then treat line endings individually in each reader. CommonMark and XML readers can deal with all kinds of line endings by themselves, other readers will apply some filter that replaces all line endings with LFs and can transition to more complicated implementation later on if needed.

In any case, I would like this minimal change to be applied as-is if no test case is found where it breaks something, just to fix the bug (both HTML and CommonMark now). Additional correctness fixes can be applied on top of it after that.

link2xt avatar Apr 15 '18 10:04 link2xt

Alexander [email protected] writes:

Before PR it works like this:

  1. Text files are read with UTF8.readFile, which applies newline normalization, removes BOM and applies crFilter.
  2. All Text pandoc readers apply crFilter again for themselves, which is no-op because CRs are already removed.

It's not a no-op! If you use pandoc as a library, you may not be getting your input from UTF8.readFile or UTF8.getContents. That's why crFilter needs to be there in the readers.

UTF8.readFile does filter CRs, it's true, because it uses toString, which uses toText, which filters CRs. You might think that because we use crFilter in the readers, we don't need to filter them here. But that's not so obvious to me. For example, consider the use of readFile to read a LaTeX include file, which is then added to the stream of input to be parsed. Here crFilter isn't called; we rely on CRs being filtered by readFile. There may be other cases like this, too. I think a very careful audit of all uses of the UTF8 module's functions is needed before we remove CR stripping there.

If we want not a minimal change, but more correct one, I propose that we change UTF8.readFile to read files in binary mode, remove BOM and replace all CRLFs with LF regardless of platform pandoc runs on. Then remove crFilter everywhere.

That doesn't work, because of the point above -- you can use the readers and pass them your own input, without ever using UTF8.readFile. This is what people typically do when they use pandoc in a web app, for example.

jgm avatar Apr 15 '18 15:04 jgm

@jgm

It's not a no-op! If you use pandoc as a library, you may not be getting your input from UTF8.readFile or UTF8.getContents. That's why crFilter needs to be there in the readers.

Hmm, I haven't thought of readers as external interface, thanks. Then readers should expect to receive anything, ok.

For example, consider the use of readFile to read a LaTeX include file, which is then added to the stream of input to be parsed. Here crFilter isn't called; we rely on CRs being filtered by readFile. There may be other cases like this, too. I think a very careful audit of all uses of the UTF8 module's functions is needed before we remove CR stripping there.

Ok, postpone this PR until I will add CR filtering to LaTeX.hs (there are two calls to readFileFromDirs there) and test it.

link2xt avatar Apr 15 '18 16:04 link2xt

The use in the LaTeX reader was just one example. There are other formats that use include files, too. Of course, now that I think of it, readFile may already be doing line-ending conversion, if that's the default for System.IO.hGetContents.

jgm avatar Apr 15 '18 18:04 jgm

Applied to \include but not for \lstinputlisting for now.

link2xt avatar Apr 16 '18 14:04 link2xt