obsidian-pandoc icon indicating copy to clipboard operation
obsidian-pandoc copied to clipboard

Wiki-links not removed when exporting from markdown

Open maybemkl opened this issue 2 years ago • 16 comments

When I changed the setting "Export files from HTML to Markdown" from HTML to Markdown, all the functionality for removing [[wiki-links]] formatting in the output PDF stops working.

maybemkl avatar Sep 06 '21 21:09 maybemkl

Hi,

This is expected behaviour. When you change this setting, you are choosing between Obsidian's markdown features and Pandoc's markdown features. If you want Pandoc citations you have to give up Obsidian wiki-links, and vice versa. There's no easy way to get the best of both worlds unfortunately.

I realise this setting is poorly worded, so I will probably change it in the future.

OliverBalfour avatar Sep 07 '21 05:09 OliverBalfour

Hi, thanks for responding. I understand the issue, but I was thinking it might still make sense to work around it. I wrote this very barebones python filter that removes the brackets upon export, which works as a patchy solution for now https://gist.github.com/maybemkl/d9be15bcabadaa19d2ca50c87b59a92e

maybemkl avatar Sep 07 '21 05:09 maybemkl

Yes, that's a fair point - this is one case where it's easy to fix formatting, but I can't fix formatting in general. I'll take a look at this in the next release

OliverBalfour avatar Sep 07 '21 05:09 OliverBalfour

Hello @OliverBalfour Thanks for your great plugin! I may have a workaround for having to choose between citations (using Markdown) or everything else (using html). I managed to get both by using pandoc twice:

  1. Using your plugin from html to convert Obsidian markdown to pandoc markdown, which does most of the heavy lifting, like transclusions, images, etc;
  2. Feeding the resulting markdown to sed, to switch the resulting escaped [, ], and @ to non-escaped [, ], and @;
  3. Feeding the resulting markdown to pandoc again, in my case with the zotero lua filter to obtain a Word file that has both Obsidian features like images and transclusions, and Zotero live citations.

Here is what my command line looks like:

cat EXPORTED_MARKDOWN_FILE.pandoc.md | sed 's/\\\[/\[/g; s/\\\]/\]/g; s/\\\@/\@/g' | pandoc -s --lua-filter zotero.lua -o DESTINATION_FILE.docx

(obviously you have to download the zotero.lua file for it to work)

Here what it looks like on a sample page :

Markdown page in Obsidian, with both Obsidian stuff and citations:

image

Resulting docx file:

image

I did not test it extensively yet, but for now it seems to work very well. I know it's kinda hacky but I've seen much hackier code before. If it can give you ideas.

Thanks again,

Felix

felixchenier avatar Oct 28 '21 12:10 felixchenier

Hi @felixchenier your temporary proposal looks great, but I would be loosing footnotes - any idea for that ? I'm also looking to build a filter for Obsidian's ==highlight== syntax : any idea on how to build a python or whatever filter that would end up with a highlighted text in Word ? Thanks !

Limezy avatar Nov 19 '21 03:11 Limezy

Hi @Limezy Unfortunately this is only a workaround, I didn't play in all that filter stuff and I'm not sure how my proposal could also apply to footnotes since I didn't try it. If the problem is the presence of an escaped hat in the output pandoc file, then you could change the sed command in my example to also remove escapes for hats:

cat EXPORTED_MARKDOWN_FILE.pandoc.md | sed 's/\\\[/\[/g; s/\\\]/\]/g; s/\\\@/\@/g; s/\\\^/\^/g' | pandoc -s --lua-filter zotero.lua -o DESTINATION_FILE.docx

But I didn't try it. And for highlights, it may be the same thing but for escaped equals:

cat EXPORTED_MARKDOWN_FILE.pandoc.md | sed 's/\\\[/\[/g; s/\\\]/\]/g; s/\\\@/\@/g; s/\\\^/\^/g; s/\\\=/\=/g' | pandoc -s --lua-filter zotero.lua -o DESTINATION_FILE.docx

But I also didn't try it. It may (will) break something, somewhere, at some moment, that's for sure!

Good luck

felixchenier avatar Nov 22 '21 20:11 felixchenier

a usable lua-filter

function Str (str)
    return (str.text
    :gsub('%[%[', '')
    :gsub('%]%]', ''))
 end

wenbopeng avatar Sep 16 '22 01:09 wenbopeng

Wow that's great @wenbopeng Perfectly working ! I'll try to adapt is using a regex to parse the [[xxx#yyy]] format

Limezy avatar Oct 03 '22 17:10 Limezy

For whoever is interested, my current filter :

function Str (str)
    return (str.text
    :gsub('%[%[', '')
    :gsub('%]%]', ''))
 end

Now only waiting for the plugin to manage transclusions !

Limezy avatar Oct 07 '22 11:10 Limezy

@Limezy as your previous snippet is the same as wenbopeng's and you wanted to adapt that to handle the [[xxx#yyy]] format... You've go another snipped, that does that by any chance?
Actually I am looking for some inspiration for how to deal with [[xxx^section|alias]] -> alias filtering.

bvorak avatar Apr 20 '23 07:04 bvorak

@bvorak and @Limezy I'm also looking for a filter of [[xxx^section|alias]] -> alias and [[xxx#yyy]]. Did you succeed?

jankap avatar Jun 19 '23 23:06 jankap

@bvorak and @Limezy I'm also looking for a filter of [[xxx^section|alias]] -> alias and [[xxx#yyy]]. Did you succeed?

https://regex101.com/r/jCiF1r/1 seems to solve most cases for me as of now. But there must be a more accessible way no? :D What it does not handle are links without an alias but with a section qualifier like [[xxx#section]].

#whitespace have to be ingnored by editor
(?|
\[\[(?:.[^\|\]]*)\|(.[^\]]*)\]\]
|
\[\[(.[^\|\#\]]*)\]\]
)

bvorak avatar Jun 20 '23 08:06 bvorak

@bvorak @jankap my current lua filter is now a bit crazy. It will detect the [[A#B#C|D]] syntax (because I also have a special [[A#B#C]] plugin but you can use it as is. I still have one bug case, when you have the [[wikilink]].[[wikilink]] syntax with only a dot between two links and no space.

It will replace "[[ANYTHING|D]]" by "D" It will replace "[[A]]" by "A" It will replace "[[A#B]]" by "A, B" It will replace "[[A#B#C]]" by "A, B-C"

"A", "B", "C" or "D" markdown markups will be converted to LaTeX equivalent markup. For example, "[[A#B#C|This is a bold word]]" will be converted to "This is a \emph{bold} word".

You can probably easily change these behaviours using my example as a starting point. ChatGPT may help you get sense of what's going on. It's not one, but two filters that you have to run one after the other :

Filter one

--[[
Add support for a custom inline syntax.
This pandoc Lua filter allows to add a custom markup syntax
extension. It is designed to be adjustable; it should not be
necessary to modify the code below the separator line.
The example here allows to add highlighted text by enclosing the
text with `==` on each side. Pandoc supports this for HTML output
out of the box. Other outputs will need additional filters.
Copyright: © 2022 Albert Krewinkel
License: MIT
]]

-- Lua pattern matching the opening markup string.
local opening = "%[%["

-- Lua pattern matching the closing markup string.
local closing = "%]%]"

-- Toggle whether the opening markup may be followed by whitespace.
local nospace = true

-- Function converting the enclosed inlines to their internal pandoc
-- representation.
local function markup_inlines(inlines)
  local result = pandoc.Span(inlines)
  result.attr = { class = "wikiLink" }
  return result
end

------------------------------------------------------------------------

local function is_space(inline)
  return inline and
    (inline.t == "Space" or
      inline.t == "LineBreak" or
      inline.t == "SoftBreak" or
      (inline.t == "Str" and inline.text:match("^%s*$")))
end

function Inlines(inlines)
  local result = pandoc.Inlines{}
  local markup = nil
  local start = nil
  for i, inline in ipairs(inlines) do
    if inline.tag == "Str" then
      if not markup then
        local before, first = inline.text:match("(.-)" .. opening .. "(.*)")
        if first then
          start = inline -- keep element around in case the
          -- markup is not closed. Check if the
          -- closing pattern is already in this
          -- string.
          local selfclosing, after = first:match("(.-)" .. closing .. "(.*)")
          if selfclosing then
            result:insert(pandoc.Str(before))
            result:insert(markup_inlines{ pandoc.Str(selfclosing) })
            result:insert(pandoc.Str(after))
          elseif nospace and first == "" and is_space(inlines[i + 1]) then
            -- the opening pattern is followed by a space, but the
            -- config disallows this.
            result:insert(inline)
          else
            local target = first
            local pipe = target:find("|")
            local hashes = {}
            for hash in target:gmatch("#") do
              table.insert(hashes, hash)
            end
            local hashCount = #hashes

            if pipe then
              target = target:sub(pipe + 1)
            elseif hashCount > 0 then
              local sections = {}
              local sectionCount = hashCount + 1
              for section in target:gmatch("[^#]+") do
                table.insert(sections, section)
              end
              if sectionCount == 2 then
                target = table.concat(sections, ", ")
              else
                local firstSection = table.concat(sections, ", ", 1, sectionCount - 2)
                local lastSection = sections[sectionCount - 1] .. "-" .. sections[sectionCount]
                target = firstSection .. ", " .. lastSection
              end
            end

            result:insert(pandoc.Str(before))
            markup = pandoc.Inlines{ pandoc.Str(target) }
          end
        else
          result:insert(inline)
        end
      else
        local last, after = inline.text:match("(.-)" .. closing .. "(.*)")
        if last then
          markup:insert(pandoc.Str(last))
          result:insert(markup_inlines(markup))
          markup = nil
          result:insert(pandoc.Str(after))
        else
          markup:insert(inline)
        end
      end
    else
      local acc = markup or result
      acc:insert(inline)
    end
  end

  -- keep unterminated markup
  if markup then
    markup:remove(1) -- the stripped-down first element
    result:insert(start)
    result:extend(markup)
  end
  return result
end

Filter two

function replaceHashMarks(text)
  local hashCount = select(2, text:gsub("#", ""))
  if hashCount == 1 then
    return text:gsub("#", ", ")
  elseif hashCount >= 2 then
    return text:gsub("#", ", ", 1):gsub("#", "-", 1)
  else
    return text
  end
end

function stringifyInline(inline)
  if inline.t == "Str" then
    return inline.text
  elseif inline.t == "Emph" then
    return "*" .. stringifyWithMarkup(inline.content) .. "*"
  elseif inline.t == "Strong" then
    return "**" .. stringifyWithMarkup(inline.content) .. "**"
  elseif inline.t == "Code" then
    return "`" .. inline.text .. "`"
  elseif inline.t == "Link" then
    return "[" .. stringifyWithMarkup(inline.content) .. "](" .. stringifyWithMarkup(inline.target) .. ")"
  elseif inline.t == "Image" then
    return "!" .. "[" .. stringifyWithMarkup(inline.content) .. "](" .. stringifyWithMarkup(inline.src) .. ")"
  elseif inline.t == "Space" then
    return " "
  elseif inline.t == "SoftBreak" or inline.t == "LineBreak" then
    return "\n"
  elseif inline.t == "Subscript" then
    return "~" .. stringifyWithMarkup(inline.content) .. "~"
  elseif inline.t == "Superscript" then
    return "^" .. stringifyWithMarkup(inline.content) .. "^"
  elseif inline.t == "Strikethrough" then
    return "~~" .. stringifyWithMarkup(inline.content) .. "~~"
  else
    local parts = {}
    for _, elem in ipairs(inline.content) do
      table.insert(parts, stringifyInline(elem))
    end
    return table.concat(parts)
  end
end

function stringifyWithMarkup(content)
  local output = ""

  for _, inline in ipairs(content) do
    local inlineString = stringifyInline(inline)

    if inline.t == "Link" and #inline.content == 1 and inline.content[1].t == "Str" then
      -- Handle wikilinks enclosed within parentheses
      local linkText = inline.content[1].text
      if linkText:sub(1, 1) == "(" and linkText:sub(-1) == ")" then
        inlineString = "(" .. inlineString .. ")"
      end
    end

    output = output .. inlineString
  end

  -- Remove curly braces
  output = output:gsub("{", ""):gsub("}", "")

  return output
end

function Span(span)
  if span.classes:includes('wikiLink') then
    local content = stringifyWithMarkup(span.content)
    local modifiedContent = replaceHashMarks(content)
    modifiedContent = modifiedContent:gsub('%[%[', ''):gsub('%]%]', '')

    local pipeIndex = modifiedContent:find('|')
    if pipeIndex then
      modifiedContent = modifiedContent:sub(pipeIndex + 1)
    end

    -- Convert the modified content to LaTeX
    local modifiedAst = pandoc.read(modifiedContent, "markdown-fancy_lists")
    local latexContent = pandoc.write(modifiedAst, "latex")

    -- Wrap latexContent in curly braces
    latexContent = "{" .. latexContent .. "}"

    local newAttributes = pandoc.Attr(span.identifier, span.classes:filter(function (c) return c ~= 'wikiLink' end), span.attributes)

    return pandoc.RawInline("latex", latexContent)
  end
end

return {
  { Span = Span }
}

Limezy avatar Jun 20 '23 10:06 Limezy

Credits to https://github.com/tarleb for getting me started with the wikilink syntax detection

Limezy avatar Jun 20 '23 10:06 Limezy

@Limezy how do you call it? pandoc -L /data/tools/strip_wikilinks_1.lua -L /data/tools/strip_wikilinks_2.lua ... seems not to work, there's still one bracket left.

image

Edit: should the filters also support ![[image.png]] pictures?

jankap avatar Jul 03 '23 09:07 jankap

@Limezy thanks for the terrific work on creating a full Lua solution.

However, I found Albert's script structure quite long and hard to understand. I created a simplified alternative script preserving only the first two rules:

  • replace [[ANYTHING|D]] with D
  • replace [[A]] by A

I'm exploiting the fact that in Obsidian the link label can't have formatting and the only elements between the brackets are strings and spaces. Moreover, multiple spaces count as one, so you can suppress inline elements by replacing them with spaces.

I added some logic to remove the section anchors (^abc) that would otherwise be rendered as text. Here some caveats:

  • Pandoc automatically absorbs the slash in \^, so there is no way to differentiate between Obsidian anchors and similar verbatim string.
  • similarly, it drops the ! before images and transclusions (but not withgfm) so the script cannot differentiate between them and wikilinks
  • you need to suppress the extension -implicit_header_references, otherwise in the unfortunate case where a wikilink is the same as a section title, it will be linked instead of being rendered as plain text

Here's the script:

-- wikilinks.lua

-- remove wikilinks identifiers and replace them with the link text
function clean (text)
    return text
      :gsub("%[%[([^|]-)%]%]", "%1")   -- remove simple wikilinks
      :gsub("%[%[.-|(.-)%]%]", "%1")   -- remove wikilinks with custom text
 end

function Blocks(blocks)
  for _,elem in pairs(blocks) do
    if elem.t == "Para" then
      local start = nil
      for i, inline in ipairs(elem.content) do
        if inline.tag == "Str" then
          -- remove links identifiers
          inline.text = inline.text:gsub("^%^%w+", "")
          -- Pandoc always parse the escapes, so there is no way to tell 
          -- ^ and \^ apart
        end

        -- select range corresponding to wikilink and subsitute it
        if inline.tag == "Str" and inline.text:match("%[%[") then
          start = i
        end
        if inline.tag == "Str" and inline.text:match("%]%]") and start then
          local result = elem.content[start].text
          for j = start+1, i do
            if elem.content[j].tag == "Str" then
              result = result .. elem.content[j].text
            else -- if it's not a string, it's a Space
              result = result .. " "
            end
            elem.content[j] = pandoc.Space()
          end
          elem.content[start].text = clean(result)
          start = nil
        end
      end
    end
  end
  return blocks
end

Banus avatar Jul 21 '23 01:07 Banus