hakyll Add Pandoc Meta data to Context

One interesting feature of ReST is that you can define document variables and meta data directly inside the document, without having to rely on some additional YAML header.

For example, the title of the document and its eventual subtitle can be inferred from the first headers of the file (see Markup Specification (Document Structure)), and if the first non-comment element is a definition list, its fields update the bibliographic information of the document (see Markup Specification (Bibliographic Fields)).

Although the ReST Parser of Pandoc is far from perfect (it does not support custom directives or roles, and the Pandoc AST is quite restrictive), it does implement the aforementioned features, in standalone mode, and populates the Meta information of the Pandoc document.

Document Title
==============

Subtitle
--------

:author: flupe

However, it seems as though the Pandoc compilers provided by Hakyll completely ignore the meta information of the parsed Pandoc documents.

I don't really know if other markup languages supported by Pandoc also populate the meta information, but I do think it would be useful to provide an easier way to inject this meta information into Hakyll contexts.

For a custom site I've just set up, it is somehow working a little. Here is the relevant part (source):

 match "posts/*" $ do
    route $ setExtension "html"

    let ropts = defaultHakyllReaderOptions { readerStandalone = True }
        wopts = defaultHakyllWriterOptions

    compile $ do
        document <- getResourceBody >>= readPandocWith ropts

        let
            Pandoc meta _ = itemBody document

            inlinesToString :: [Inline] -> String
            inlinesToString inlines =
                concatMap inlineToString inlines
                where
                    inlineToString (Str a) = a
                    inlineToString (Space) = " "

            extractMeta :: String -> MetaValue -> Context a
            extractMeta name metavalue =
                case metavalue of
                    MetaInlines inlines -> mkField $ inlinesToString inlines
                    _ -> mempty
                where
                    mkField = field name . const . return

            ctx :: Context String
            ctx = foldMapWithKey extractMeta (unMeta meta)
                <> postCtx

        writePandocWith wopts document
            &   loadAndApplyTemplate "templates/post.html" ctx
            >>= loadAndApplyTemplate "templates/default.html" ctx
            >>= relativizeUrls

Essentially, we:

Parse the post and get an Item Pandoc
Create a new context ctx populated with the metadata from the document.
Render the post with templates using the new context we created.

However, this mechanism in ReST implies that you actually need to parse the document to get the context, and I have no idea how to make this work well with other rules such as creating and index.html or archives.html page (without having to parse again for each rule).

In conclusion, here's what I am suggesting:

adding a proper way to transform Pandoc Meta into an Hakyll Context.
a way to associate this information and posts inside snapshots?

Jun 15 '18 16:06 flupe

+1.

In my case, I have Pandoc filter that counts the number of words in blog posts, and deduces the time-to-read. This information is then stored in the Pandoc metadata. I would like to tap into this information via a template.

I'm willing to draft a pull request if someone can help me understand the steps required.

May 15 '19 01:05 LaurentRDC

I have an idea which I'm going to try: extend the Provide type by adding a new field - providerMetadata :: FilePath -> IO Metadata (or something line this). And thus I will be able to use any routine to extract any metadata. They can be merged with existing.

Jun 03 '20 19:06 ip1981

Thank you for your work!

However I don't think it is the right solution for this specific issue just yet. The thing is, because the document metadata is accessible from pandoc once the entire document has been parsed, you would ideally parse the file once, and make hakyll use both the document metadata and the parsed content from there on. With your current solution, while it is true that you can specify a custom metadata provider, it still relies on parsing the document from scratch again and again. That is very slow.

A more appropriate solution would be to improve how files are loaded into the store (See load here: https://hackage.haskell.org/package/hakyll-4.13.3.0/docs/src/Hakyll.Core.Provider.MetadataCache.html#resourceMetadata) so that for pandoc documents, we can retrieve there the metadata and store some Item Pandoc instead of Item String (or whatever the internal pandoc representation type is called).

Given the lack of response from hakyll's maintainer @jaspervdj , I did not start working on this as I had very little hope for such a change to be merged, and thought that designing a proper solution required some more discussion. Might look at it in the future if I can find some time. Please tell me if I got your PR wrong.

Jun 04 '20 10:06 flupe

Meh, Pandoc metadata is not trivial (I personally do not want to lose title formatting - from LaTeX), someone may use Hakyll without Pandoc, etc. After all I want a universal solution. Thus just a simple FilePath -> IO Metadata. Additionally, one may want to build metadata from the resource body and not rely on Pandoc's metadata.

Jun 04 '20 10:06 ip1981

Let me reformulate. I have no doubt your PR is useful and would love to see it merged. What I'm arguing is that it does not resolve this issue, hence I'd rather you did not put "Closes #643" in the PR comment.

As for providing only a generic solution and not an additional one for pandoc documents, I don't believe the argument that "someone may use Hakyll without Pandoc" to be sufficient. Hakyll is very much made to work well with pandoc (pandoc compilers would not have been included otherwise), and I think there is value to optionally ease the handling of pandoc metadata. The upvotes this issue received suggest other people are interested as well. The performance concern due to parsing at least twice every document still stands, and if we can do better by being less generic then so be it.

Jun 04 '20 11:06 flupe

Closing this now. After @ip1981's comment and PR more than a week ago I started investigating whether it could be used as a starting point for solving this issue. Still, I was just trying to work against every abstraction Hakyll is using.

All in all Hakyll was simply not a generic enough tool for what I wanted (that's not bad per se!). Ended up making my own library which you can hear more about here. It's very tiny and does everything I want. Closing the issue.

Jun 15 '20 14:06 flupe

Still, I was just trying to work against every abstraction Hakyll is using.

@flupe, do you mean that you tried implementing that solution you suggested, which would make Hakyll work with an Item Pandoc instead of an Item String?

Jul 31 '20 16:07 gnull

I think this issue should be unclosed. I just ran into a similar problem where I assumed that the Pandoc ASTs being passed into my transformation pipeline would have their YAML metadata, because why wouldn't they? I coded up a complete solution to my problem before running it and discovering that no, Hakyll strips all of the Pandoc metadata (why???) and I had to come up with an entirely different Hakyll approach. Saying that there is some other non-Hakyll library which does it differently is in no way a solution which closes this problem! (By that logic, you could close every Hakyll issue because there is presumably at least one tool out there which in some way doesn't have that issue...)

Aug 29 '22 18:08 gwern

@gwern Can you post a summary of your instance of the problem, so I can poke around and understand what pieces are involved? Then we can discuss possible solutions.

Aug 29 '22 18:08 Minoru

https://github.com/gwern/gwern.net/commit/9542a9a4693d0e3f34bc79f420b37fe0066305c6

It's fairly straightforward: I use a pandocTransformWith to run a bunch of Pandoc API transformations; for 'index' pages (pages which have index: true set in the YAML metadata eg neural net video generation bibliography), they are 'simpler' and the HTML template disables a bunch of stuff, and I thought I would disable several of the transformations as well because they are slow & cause some bugs. (This obviously can't be done at the final templating pass, because the HTML template is generated long after all of this has run; it has to be done inside the previous Compiler stage, in Hakyll-land.) So, since index: true is available in the Pandoc document type Pandoc as stored in the Data.Map.Map bundled with the actual [Blocks], and you just extract it with unMeta (p::Pandoc) and then look up the boolean variable, I thought I'd simply augment my transform pipeline with a quick lookup of index and then toggle the expensive transforms based on that. This is logical, typechecks, and runs perfectly. It's just that Hakyll erases the Pandoc metadata and you wind up with a metadata of just [], with all the original YAML values erased including index, and so the index-check never gets set to True and the expensive passes always run...

My preferred solution would be for Hakyll to simply not erase the original Pandoc metadata. Does anyone expect it to do that? You'd expect it to read a Pandoc as specified from the files specified, and it to not molest the Pandoc to erase the metadata or whatever is going on behind the scenes there.

A quick reminder of the relevant Pandoc types:

ghci> :i Pandoc
type Pandoc :: *
data Pandoc = Pandoc !Meta ![Block]
  	-- Defined in ‘Text.Pandoc.Definition’
instance Eq Pandoc -- Defined in ‘Text.Pandoc.Definition’
instance Monoid Pandoc -- Defined in ‘Text.Pandoc.Definition’
instance Ord Pandoc -- Defined in ‘Text.Pandoc.Definition’
instance Semigroup Pandoc -- Defined in ‘Text.Pandoc.Definition’
instance Show Pandoc -- Defined in ‘Text.Pandoc.Definition’
instance Read Pandoc -- Defined in ‘Text.Pandoc.Definition’
ghci> :i Meta
type Meta :: *
newtype Meta
  = Meta {unMeta :: Data.Map.Internal.Map
                      Data.Text.Internal.Text MetaValue}
  	-- Defined in ‘Text.Pandoc.Definition’
instance Eq Meta -- Defined in ‘Text.Pandoc.Definition’
instance Monoid Meta -- Defined in ‘Text.Pandoc.Definition’
instance Ord Meta -- Defined in ‘Text.Pandoc.Definition’
instance Semigroup Meta -- Defined in ‘Text.Pandoc.Definition’
instance Show Meta -- Defined in ‘Text.Pandoc.Definition’
instance Read Meta -- Defined in ‘Text.Pandoc.Definition’
ghci> :i MetaValue
type MetaValue :: *
data MetaValue
  = MetaMap !(Data.Map.Internal.Map
                Data.Text.Internal.Text MetaValue)
  | MetaList ![MetaValue]
  | MetaBool !Bool
  | MetaString !Data.Text.Internal.Text
  | MetaInlines ![Inline]
  | MetaBlocks ![Block]
  	-- Defined in ‘Text.Pandoc.Definition’
instance Eq MetaValue -- Defined in ‘Text.Pandoc.Definition’
instance Ord MetaValue -- Defined in ‘Text.Pandoc.Definition’
instance Show MetaValue -- Defined in ‘Text.Pandoc.Definition’
instance Read MetaValue -- Defined in ‘Text.Pandoc.Definition’
ghci> :i Block
type Block :: *
data Block
  = Plain ![Inline]
  | Para ![Inline]
  | LineBlock ![[Inline]]
  | CodeBlock !Attr !Data.Text.Internal.Text
  | RawBlock !Format !Data.Text.Internal.Text
  | BlockQuote ![Block]
  | OrderedList !ListAttributes ![[Block]]
  | BulletList ![[Block]]
  | DefinitionList ![([Inline], [[Block]])]
  | Header {-# UNPACK #-}Int !Attr ![Inline]
  | HorizontalRule
  | Table !Attr
          !Caption
          ![ColSpec]
          !TableHead
          ![TableBody]
          !TableFoot
  | Div !Attr ![Block]
  | Null
  	-- Defined in ‘Text.Pandoc.Definition’
instance Eq Block -- Defined in ‘Text.Pandoc.Definition’
instance Ord Block -- Defined in ‘Text.Pandoc.Definition’
instance Show Block -- Defined in ‘Text.Pandoc.Definition’
instance Read Block -- Defined in ‘Text.Pandoc.Definition’

Aug 29 '22 20:08 gwern

hakyll hakyll copied to clipboard

Add Pandoc Meta data to Context

hakyll
hakyll copied to clipboard