pandoc icon indicating copy to clipboard operation
pandoc copied to clipboard

Broken links between multiple input Markdown files

Open bebuch opened this issue 5 years ago • 8 comments
trafficstars

We convert our documentation from Markdown to PDF using Pandoc. Usually several Markdown files are converted to one PDF file.

Links between Markdown files that are included in the same PDF are broken in the PDF.

A very similar bug report already existed in 2016 with #2719.

Minimal example

File test-1.md:

# Headline 1

some text

## Headline 2

more text

File test-2.md:

# Headline 1

some other text

## Another headline

more other text

[link to #headline-1](#headline-1) **wrong link**
refers to Headĺine 1 in test-1, shoud refer to Headĺine 1 in test-2

[link to #another-headline](#another-headline) **works**
as expected because anchor is unique

[link to test-1.md](test-1.md) **broken link**

[link to test-1.md#headline-1](test-1.md#headline-1) **broken link**

[link to test-1.md#headline-2](test-1.md#headline-2) **broken link**

[link to test-2.md](test-2.md) **broken link**

[link to test-2.md#headline-1](test-2.md#headline-1) **broken link**

[link to test-2.md#another-headline](test-2.md#another-headline) **broken link**

Compile it to PDF:

docker run --rm --volume "$(pwd):/data" --user $(id -u):$(id -g) pandoc/latex:2.9.2.1 test-1.md test-2.md -o test.pdf

You get the same behavior with HTML which is simpler to debug:

docker run --rm --volume "$(pwd):/data" --user $(id -u):$(id -g) pandoc/latex:2.9.2.1 test-1.md test-2.md -o test.html

Here is the HTML output:

<h1 id="headline-1">Headline 1</h1>
<p>some text</p>
<h2 id="headline-2">Headline 2</h2>
<p>more text</p>
<h1 id="headline-1-1">Headline 1</h1>
<p>some other text</p>
<h2 id="another-headline">Another headline</h2>
<p>more other text</p>
<p><a href="#headline-1">link to #headline-1</a> <strong>wrong link</strong> refers to Headĺine 1 in test-1, shoud refer to Headĺine 1 in test-2</p>
<p><a href="#another-headline">link to #another-headline</a> works as expected because anchor is unique</p>
<p><a href="test-1.md">link to test-1.md</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-1">link to test-1.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-2">link to test-1.md#headline-2</a> <strong>broken link</strong></p>
<p><a href="test-2.md">link to test-2.md</a> <strong>broken link</strong></p>
<p><a href="test-2.md#headline-1">link to test-2.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-2.md#another-headline">link to test-2.md#another-headline</a> <strong>broken link</strong></p>

Expected behavior

  1. links to anchors (without file name) should also refer in the converted file to the correct headline from the original Markdown file and not to the first identical headline from all Markdown files
  2. links to files with anchors that belong to the list of markdown files passed to Pandoc should link to the corresponding anchor of the converted file and not to the original file itself
  3. links to files without anchors that belong to the list of markdown files passed to Pandoc are a bit more difficult to resolve
    1. If the markdown file referred to starts with a headline (of any order), it should be linked to this headline
    2. Otherwise, an additional anchor must be inserted at this point in the target document to which the link is then made (alternatively, such links could be removed pragmatically, since such cases are likely to be very rare, but the first proposed solution would be preferable if it can be implemented, since very rare is not a never)

bebuch avatar May 20 '20 11:05 bebuch

I think you have a misconception about how pandoc treats its input files.

If you include multiple files on the command line, pandoc concatenates their contents and parses the result, paying no attention to what file a particular bit of markdown is found in. (In this respect it works like a lot of other unix tools.)

So the fact that the link is found in the second file makes no difference.

You might try experimenting with the --file-scope option, depending on your needs. (Note that this has certain limitations, though: e.g. with that option you can't define a link reference in one file and use it in another.)

jgm avatar May 20 '20 14:05 jgm

I understand that this is how it works currently, but this is not useful behavior.

Thanks for your comments, that helped me to better understand the current behavior. The current behavior is in accordance with the documentation, so this is not a bug report, but an enhancement request.


The point is that PDF is a file format that always consists of one file. All contents are embedded in it.

In Markdown (and also HTML for example), however, the same information is handled in multiple files within a directory structure. This is absolutely necessary, because images, for example, cannot be embedded in these formats.

A document converter should be able to convert such multi-file document formats into single-file formats without breaking the 'in document' links and image inclusions.

If I understand it correctly, the conversion process is currently divided into a read and a write process. To implement the proposed behavior, the reading process for markdown files would have to be adapted.

  1. Load input files individually
  2. Parse input files individually
  3. Adjust relative links between all included files (including images)
  4. Merge ASTs

Technically, I see no reason why this could not be implemented. It would make the tool much more useful and easier to use. For example, in many cases, the use of the parameter --resource-path would become obsolete, since the inclusion with the intelligent behavior simply works.

I understand this is a major change, but I think it would save a lot of people a lot of time and energy. I would therefore be very pleased about a second review.

If you agree, please reopen the issue.

bebuch avatar May 20 '20 17:05 bebuch

--file-scope does 1, 2, and 4 -- did you try it?

The one thing it won't do is rewrite heading IDs or links, so in your example the two identical headings would not receive unique anchors. I'd suggest dealing with this by using explicit identifiers (see link attribute syntax) when you have duplicated headings.

jgm avatar May 20 '20 18:05 jgm

If you'd like, you can create a more general issue that requests that the markdown parser be made sensitive to the file containing each particular bit of content. This would require a fairly big architectural change: the readers would have to be changed to take a [(FilePath, Text)] argument instead of just a Text. Currently they simply don't have access to information about the containing file.

jgm avatar May 20 '20 18:05 jgm

answer first comment

Step 3 is the most important one in this process. ;-)

When I compile to HTML with --file-scope, all links described above as broken are still broken. In addition, the HTML is also invalid, because now two elements have the same ID.

<h1 id="headline-1">Headline 1</h1>
<p>some text</p>
<h2 id="headline-2">Headline 2</h2>
<p>more text</p>
<h1 id="headline-1">Headline 1</h1>
<p>some other text</p>
<h2 id="another-headline">Another headline</h2>
<p>more other text</p>
<p><a href="#headline-1">link to #headline-1</a> <strong>wrong link</strong> refers to Headĺine 1 in test-1, shoud refer to Headĺine 1 in test-2</p>
<p><a href="#another-headline">link to #another-headline</a> <strong>works</strong> as expected because anchor is unique</p>
<p><a href="test-1.md">link to test-1.md</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-1">link to test-1.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-2">link to test-1.md#headline-2</a> <strong>broken link</strong></p>
<p><a href="test-2.md">link to test-2.md</a> <strong>broken link</strong></p>
<p><a href="test-2.md#headline-1">link to test-2.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-2.md#another-headline">link to test-2.md#another-headline</a> <strong>broken link</strong></p>

The important point is that if you open the markdown files with OOO for example, the links will all work before conversion. After the conversion to HTML or PDF the same links are broken. The links are not included in the conversion, so the conversion is incomplete. (Always assumed, the Markdown files are considered as one document that is spread over several files).

answer second comment

If I interpret your description and the behavior of --file-scope correctly, then step 3 can be done before merging the ASTs.

I haven't checked the source code for this yet, but I would suspect that the merging of the ASTs takes place at a location where the original filenames (including path) are available. If this is the case, the addition of this step would be less complex.

I don't know if I can get around to checking it today, but probably sometime next week. Thanks for the feedback!

bebuch avatar May 20 '20 18:05 bebuch

step 3 can be done before merging the ASTs.

Yes, that's true, when --file-scope is used we have

        mconcat <$> mapM (readSource >=> r readerOpts) sources'

It would be possible, for example, to insert an identifier prefix derived from the file name before each internal identifer. Internal links to rewritten ids could also be rewritten.

I'm not so sure about links to other parts of the document (in other files). You are expecting it to work with a link to other-markdown-file.md#ident, so I guess your idea is that if other-markdown-file.md is one of the files on the command line, this wolud get rewritten to something like #other-markdown-file-md-ident.

That would mean that, for example, you couldn't link to the document's source file from the document -- perhaps an undesirable consequence for some users.

Paging @jkr who added the file-scope option originally and may have some thoughts on whether it should be changed in this way.

jgm avatar May 20 '20 18:05 jgm

so I guess your idea is that if other-markdown-file.md is one of the files on the command line, this wolud get rewritten to something like #other-markdown-file-md-ident.

Exactly!

I'm not so sure about links to other parts of the document (in other files). You are expecting it to work with a link to other-markdown-file.md#ident

These links can remain unchanged. The linked data is not part of the converted document, so it is okay if it breaks if the linked data is not copied separately.

Ideally, you could convert them to absolute HTTP paths using an additional Pandoc command line option like --rebase-relative-links-on 'https://github.com/jgm/pandoc/tree/master/doc', but that might need to be addressed in a separate issue afterwards.

bebuch avatar May 20 '20 19:05 bebuch

Is this solved now? I agree that multi file handling without this option is not very useful at this time :(

mbrucher avatar Jun 18 '22 11:06 mbrucher

@mbrucher Looks like 6e45607f9948f45b2e94f54b4825b667ca0d5441 does a lot of preparation to solve it, but the actual link fixing still isn't done.

bebuch avatar Sep 06 '22 17:09 bebuch

I think the best fix is not a change in the reader itself, but rather modifying the behavior of --file-scope: Instead of

        mconcat <$> mapM (readSource >=> r readerOpts) sources'

we could have something like

        mconcat <$> mapM (\s -> readSource s >>= r readerOpts >>= rewriteLinksAndIdentifiers s sources') sources'

Here rewriteLinksAndIdentifiers would change all the identifiers in source sby adding a prefix derived from s. It would also change all links of form FILE(#anchor)? where FILE is in sources' accordingly.

This would produce the behavior you're going for. One drawback, though, is that even explicitly provided anchors would change, and this might cause problems for some people.

jgm avatar Sep 06 '22 17:09 jgm

Sounds like a reasonable idea.

As for explicitly defined anchor IDs, unfortunately I can't think of a backwards compatible solution either. As long as the generated prefix ID depends exclusively on the file name in which the user-defined anchor ID was defined, a new composed anchor ID results. The problem is that this is not possible. To be able to compile files from different directories with Pandoc, the prefix ID must necessarily include the file path. This of course depends on the computer. The problem can be reduced if a common root directory is determined for all source files. This would also shorten the length of the (prefix-)IDs, which makes sense anyway.

External links to such anchors in the generated document must be updated accordingly, which is a breaking change of Pandoc.

If the user changes his/her directory structure so that a new common base directory results, links to anchors must be updated again. It is then the user's responsibility to avoid such a change if they wish to do so.

However, I believe that this use case affects very few users. The problem that the change would solve probably affects many more users.

bebuch avatar Sep 06 '22 19:09 bebuch

However, I believe that this use case affects very few users. The problem that the change would solve probably affects many more users.

It's always very hard to know. Usually when I make a change like this, I find that all sorts of people have been relying on the old behavior!

jgm avatar Sep 06 '22 19:09 jgm

I think the fact that this doesn't work limits pandoc to small products. You can't create anything big that is still maintainable. Now, I don't think there is an issue with the links. For one file, the links are internal, no difference. Once you have more than one file, the link derives from the common folder structure (with the root removed). This behavior should make it consistent when you only have one file as well, so not sure where the breakage would occur, as this doesn't work properly for multi files at the moment.

mbrucher avatar Sep 06 '22 19:09 mbrucher

True, you can't know. The only option to keep backward comparability would be to introduce a new command line option for the changed behavior. Indeed that might be the best solution. Maybe something like '--file-prefixed-anchors'.

John MacFarlane @.***> schrieb am Di., 6. Sept. 2022, 21:26:

However, I believe that this use case affects very few users. The problem that the change would solve probably affects many more users.

It's always very hard to know. Usually when I make a change like this, I find that all sorts of people have been relying on the old behavior!

— Reply to this email directly, view it on GitHub https://github.com/jgm/pandoc/issues/6384#issuecomment-1238563837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3UCSLWQBLI2GROO37GF23V46LHHANCNFSM4NF2X6BQ . You are receiving this because you authored the thread.Message ID: @.***>

bebuch avatar Sep 06 '22 19:09 bebuch