hakyll Externally Specified FileType for Pandoc

My project is a proof-of-concept migrating an existing site with various static content, PHP code and MediaWiki pages to a Hakyll-generated static site. We have a tool that can do automatic extraction of the MediaWiki content (in source markup form) to files, which we dump under /wiki/Main_Page, /wiki/TLUG:MemberGuide, etc., and it would be convenient to work directly from these names rather than having to add extensions to the files, particularly since the Items will go through some extra processing (to deal with template substitution) before being rendered from MediaWiki markup to HTML.

So what I'd like to be able to do is, at some stage in the processing, attach a FileType of MediaWiki to the Item, but there doesn't seem to be a way to do that. readPandocWith is has a hard-coded itemFileType item call that looks at the extension of the original source file to determine the type of the item.

It feels to me as if, if the type of the Item is going to be based on the filename, that should be attached at the time the file is read, rather than later on, and it should be able to be changed during the processing chain if stages convert the type. This would also open up the ability to use typed sources other than files: for example, one might want grab source code for MediaWiki pages directly from requests to the MediaWiki API at build time, rather than making copies of that content on disk, in which case the content type would come back with the HTTP request or similar, and there would be no source filename (though there might be a source URL or similar).

Does this make sense?

Mar 30 '19 06:03 0cjs

I've created a little workaround for you. This is the simplest thing I could think of which essentially "spoofs" a different file extension.

main = hakyll $ do

  match "*.wrongextension" $ do
    compile $ do
      getResourceBody

  ids <- getMatches "*.wrongextension"

  create (fromFilePath . flip replaceExtension "mediawiki" . toFilePath <$> ids) $ do 
    route $ setExtension "html"
    compile $ do
      contents <- load =<< (fromFilePath . flip replaceExtension "wrongextension" . toFilePath <$> getUnderlying)
      makeItem (itemBody contents)
        >>= renderPandoc
        >>= loadAndApplyTemplate "templates/default.html" defaultContext
        >>= relativizeUrls

I'm wondering if the best way to solve this problem for hakyll would be to make some kind of function spoofFilePath :: (FilePath -> FilePath) -> Rules () -> Rules () which can be used like so:

match "*.wrongextension" $ spoofFilePath (replaceExtension "md") $ do
  ...

Jun 05 '19 15:06 Radvendii

I feel fairly strongly that tying the content type to the file path and/or name is not the right thing to do. After all, there are plenty of other ways one might fetch data to be used to generate the site, such as a network call (type specified by the Content-type header when using HTTP), a database lookup (probably a MIME type from a field) or running a program and using its output.

It might also be the case that during processing the type changes, such as when a module in the compiler converts images or whatever. (In fact that already happens, but I'm not sure what happens to the type information at the moment; I suspect that it's just lost or the original, now wrong type information is maintained.)

I think that the type should be generated separately by the particular routine loading the data, and then carried along with other information, and be modifable by anything doing further processing in the processing chain.

Jun 05 '19 16:06 0cjs

I want to be clear that I'm not in opposition to separating the content type from file extension. I'm just spitballing alternative solutions that might be better for some reason (easier to implement; less disruptive to the ecosystem; etc). If it works well to solve it the way you're proposing, I'm all for it.

It's also important to note that the content type is tied to the identifier not the file path. This is an important distinction because in the examples you're bringing up, where you would generate content not from a file but from elsewhere, there is no file path, and you would be generating an identifier, which you could then make have whatever file extension you want.

One example of this is using buildTags and tagsRules, another is RSS feeds. Those are both cases where you would be generating content not from a file but from inside the haskell code. In both of those cases, you have an opportunity to define the file type via a file extension when you decide the identifier name (in the second argument of buildTags and the argument to create respectively).

Jun 05 '19 16:06 Radvendii

I agree with you about what happens when file type changes during compilation. For instance, lets just say you wanted to run two passes of pandoc over a file (or probably more realistically, a different compiler and then pandoc). The first conversion might change the content type, but the hakyll compilation system would be unaware of this and the second pass would fail. In fact, as the process is currently set up, it's not possible to run multiple passes of pandoc over a file, because the output type is set by the extension of the route field.

I'm not sure how often this actually comes up, but it does seem like a problem.

Jun 05 '19 16:06 Radvendii

Ah, ok! I now realize that I never mentioned that I already have a workaround (though yours may be better; I'll compare the two when I get time).

I think we're in general agreement that type should be a property of Item values and we should have the ability to change the type along with other parts of items as they move through the processing chain.

But your point about the type being tied to the identifier and not the file path seems to contradict itself when you continue with "you would be generating an identifier, which you could then make have whatever file extension you want."

I don't know if this is a typo or if you think that file extensions really are the right way to encode types, but they're not. There's a reason that we have MIME types, and that's specifically to deal with issues such as .jpg vs. .JPG vs. .jpeg and that some file extensions identify multiple different media types.

I would prefer to see that paths of any kind are completely disconnected from file type except at the point where new data are read or generated (including being generated from transforms of old data). Thus, when data are loaded into an item from source path /a/b/c.png, but it turns out to be JPEG, the code doing the loading would set the MIME type to image/jpeg and despite the source path remaining /a/b/c.png, the data would be treated correctly. More plausibly, images converted from PNG to JPEG or vice versa wouldn't have their source paths touched at all, but would simply have their MIME type changed (and perhaps the destination path modified).

Jun 06 '19 00:06 0cjs