hakyll icon indicating copy to clipboard operation
hakyll copied to clipboard

Feature request: symlink/symbolic link for faster/smaller compiled site versions

Open gwern opened this issue 3 years ago • 12 comments

I would like symlinkCompiler which does symbolic links (or hard links) as a dropin replacement for a standard static file copying routine like my let static = route idRoute >> compile copyFileCompiler, which would be a performance optimization for compiling many large static files.

As gwern.net gets larger, particularly with audio/images/videos generated for my deep learning experiments, compiling it spends increasingly more time and disk space creating _site/. Even with a NVMe SSD, the time starts to add up; more problematically, I'm starting to run out of disk space for creating 40GB _site/ folders just to upload a few modified files & then delete it. Almost all of that disk space & IO is going to copying things like PDFs or MP4s from one folder to another. There's no particular reason those copies couldn't just be symbolic or hard links back to the original file and then I can use rsync with --copy-links to have rsync follow the links when it syncs with my gwern.net server.

Looking at the File.hs module which defines copyFileCompiler, it seems to be mostly wrappers around a single call to System.Directory's copyFileWithMetadata. Is there any reason a symbolic link version couldn't be defined by swapping out that for createFileLink like below:

diff --git a/lib/Hakyll/Core/File.hs b/lib/Hakyll/Core/File.hs
index 49af659..6a5775e 100644
--- a/lib/Hakyll/Core/File.hs
+++ b/lib/Hakyll/Core/File.hs
@@ -8,6 +8,8 @@ module Hakyll.Core.File
     , copyFileCompiler
     , TmpFile (..)
     , newTmpFile
+    , SymlinkFile (..)
+    , symlinkFileCompiler
     ) where
 
 
@@ -20,6 +22,7 @@ import           System.Directory              (copyFileWithMetadata)
 import           System.Directory              (copyFile)
 #endif
 import           System.Directory              (doesFileExist,
+                                                createFileLink,
                                                 renameFile)
 import           System.FilePath               ((</>))
 import           System.Random                 (randomIO)
@@ -56,6 +59,19 @@ copyFileCompiler = do
     provider   <- compilerProvider <$> compilerAsk
     makeItem $ CopyFile $ resourceFilePath provider identifier
 
+--------------------------------------------------------------------------------
+-- | This will not copy a file but create a symlink, which can save space & time for static sites with many large static files which would normally be handled by copyFileCompiler. (Note: the user will need to make sure their sync method handles symbolic links correctly!)
+newtype SymlinkFile = SymlinkFile FilePath
+    deriving (Binary, Eq, Ord, Show, Typeable)
+--------------------------------------------------------------------------------
+instance Writable SymlinkFile where
+    write dst (Item _ (SymlinkFile src)) = createFileLink src dst
+--------------------------------------------------------------------------------
+symlinkFileCompiler :: Compiler (Item SymlinkFile)
+symlinkFileCompiler = do
+    identifier <- getUnderlying
+    provider   <- compilerProvider <$> compilerAsk
+    makeItem $ SymlinkFile $ resourceFilePath provider identifier

The one part that puzzles me is that createFileLink src dst creates self-links. I can try something like prepending the absolute path like ("/home/gwern/wiki/"++src) but I don't understand where the correct relative/absolute path prefix comes from since I thought src dst would look like docs/foo.pdf _site/docs/foo.pdf but that's obviously not how it works...

(While a hack, prepending does work: I go from a _site/ of 41GB to <0.2GB. A good 10 minutes faster too.)

gwern avatar Jul 22 '20 17:07 gwern

Any feedback on this? I'd particularly like this upstreamed because my attempts to define it inside my own hakyll.hs have foundered on type issues with the deriving Binary & Item; they work inside File.hs but not elsewhere, requiring me to keep a forked Hakyll installed. (At this point, I'm low enough on disk space that I wouldn't be able to compile gwern.net without this optimization.)

gwern avatar Oct 31 '20 20:10 gwern

my attempts to define it inside my own hakyll.hs have foundered on type issues with the deriving Binary & Item

Please submit this as PR, I'll merge it.

Minoru avatar Nov 11 '20 22:11 Minoru

Thanks @gwern!

Minoru avatar Nov 12 '20 14:11 Minoru

So I happened to undo my local patch while doing a reinstall of my Pandoc toolchain to pull in a fix related to <figure> handling, and I think there was a misunderstanding here: my patch above is not correct. It results in symbolic self-links which are totally broken, eg

...
ls: cannot access '_site/Zeo.page': Too many levels of symbolic links
$ ls -l _site/*.page
lrwxrwxrwx 1 gwern gwern 32 Mar  9 21:01 _site/2012-election-predictions.page -> ./2012-election-predictions.page
...

That is what I was referring to in my discussion of hacking src to make it point to a correct filepath like _site/2012-election-predictions.page -> /home/gwern/wiki/2012-election-predictions.page. It needs some relatively small but unknown to me tweak to make it correct and point to ../.

I thought when you committed you'd fixed that, but trying just now it seems that is not the case?

gwern avatar Mar 10 '21 02:03 gwern

My bad! I somehow overlooked your warning about relative links when I suggested to merge this.

I thought src dst would look like docs/foo.pdf _site/docs/foo.pdf but that's obviously not how it works...

From my reading of the code, that's exactly how it works. The problem is that relative symlinks are resolved relatively to the directory in which they reside, so "./docs/foo.pdf", when resolved from inside "_site/docs/", points to "_site/docs/docs/foo.pdf".

One way to fix it would be to use System.Directory.makeAbsolute in symlinkFileCompiler. But I don't like this, because then the _site directory can't be moved to another place without breaking the links.

The other option is to make src relative to dst, but I don't see a function in System.Directory that does this. The only candidate, System.FilePath.makeRelative, explains that it doesn't introduce .. into the paths, because one of the parent directories might be itself a symlink, and going up from it might lead us to a different place altogether.

We can write our own "relativization" function: 1) take destinationDirectory, replace all components with ..; 2) take the item route, drop the filename, replace directory components with ..; 3) concatenate (1), (2), and the route. This still suffers from the same problem that's outlined in the doc for makeRelative, but I think it's on the user if they copy something into a directory which is itself a symlink. (But I think this situation is impossible, because Hakyll executes rules in arbitrary order, and if the directory doesn't exist, it'll be created.)

Alternatively, use hard links. But that'll require separate code for *nix and Windows, I believe.

I don't have the energy to work on this myself. If you want to push this to completion, I'm open to further discussions, you can bounce ideas off me if you want. Otherwise I can just revert the current version, re-open this issue, and wait until someone gets motivated to finish this off.

Minoru avatar Mar 10 '21 15:03 Minoru

Okay, the patch is now reverted. Sorry for the mess I've caused here >_<

Let's wait until someone has energy to brush this up and submit a new one.

Minoru avatar Mar 14 '21 13:03 Minoru

If it's unclear which function to use, perhaps we can push it onto the user. Right now my hack is to add in a /home/gwern/wiki/ prefix to make the symlink paths absolute (and then it rsyncs fine to the actual server). Perhaps the function can be parameterized to take such a prefix? Defaulting to the current working directory. So then I'd write compile (symlinkFileCompiler Nothing) or to be explicit, compile (symlinkFileCompiler $ Just "/home/gwern/wiki/").

gwern avatar Mar 19 '21 20:03 gwern

Sorry for such a delay replying, I got buried under some life stuff.

Upon re-reading the thread, I think the easiest way forward is to use hard links, and implement them just for the OS that you, @gwern, are using. If somebody needs it on a different OS, they can submit a patch later. If somebody absolutely needs symbolic links (e.g. because their destination directory is on a different disk), they can re-visit this issue and see what they can come up with. What do you think of that?

In case you're against that, I'll also comment on parameterising symlinkFileCompiler: I think it's better to have a separate function for this, like symlinkFileCompilerWithBasePath or something. Once the path-relativization kinks are worked out, we can provide a shorter symlinkFileCompiler that doesn't need a path.

Minoru avatar Mar 28 '21 13:03 Minoru

I have not tried using hardlinks before, but I'm willing to give it a try.

gwern avatar Mar 28 '21 13:03 gwern

Any update on this? Was there any hardlink patch I was supposed to test?

gwern avatar May 07 '24 16:05 gwern

Not from me; I didn't find the energy to write the hardlinking patch yet.

Minoru avatar May 08 '24 19:05 Minoru