funflow listDirContents defeats caching?

trafficstars

The first step of my workflow produces a directory of approximately 500 job specifications.

The next step then splits this directory up using listDirContents. The result is a list of Content File.

My expectation was that if any of these Content File were already in the store, it would not trigger recompilation in the next step. In particular, the next step fetches something from a website which will never change, I don't want it repeated multiple times.

What actually seems to happen is that if any of these files are changed then all the jobs are executed. After the resource has been fetched, then the pipeline stops as funflow realises that the output of the script already exists in the store.

This analysis might be wrong but in any case a lot more recompilation is happening than I expect and I'm finding it hard to debug the reason why.

Sep 10 '18 15:09 mpickering

I fixed this by defining my own combinator which properly splits up a directory and puts all the files into their own store locations.

splitDir :: ArrowFlow eff ex arr => arr (Content Dir) ([Content File])
splitDir = proc dir -> do
  (_, fs) <- listDirContents -< dir
  mapA reifyFile -< fs


-- Put a file, which might be a pointer into a dir, into its own store
-- location.
reifyFile :: ArrowFlow eff ex arr => arr (Content File) (Content File)
reifyFile = proc f -> do
  file <- getFromStore return -< f
  putInStoreAt (\d fn -> copyFile fn d) -< (file, CS.contentFilename f)

Sep 10 '18 18:09 mpickering

A similar thing happens with copyDirToStore, if any files changes in the directory then any script which depends on a file in the directory you copied into the store is recompiled.

Sep 10 '18 18:09 mpickering

I've just read your blog post, which was great! Seems like most of the problems you're seeing are reflections of the fact that funflow's intentionality is all done at the content store item level, rather than the file level. There is a reason for this; it makes it easy for external steps to work with the content store, because they just get given the directory and can work with it directly.

My instinctive thought was that this would be a tricky thing to change, but on consideration we do have a potential way around this. When a step is completed we move it from its pending location to a completed path determined by its contents. We could augment this process by first moving each individual file in the store to a content addressed location, and then build the "completed" path based upon that. We could then change the hash for a file item inside the store to be determined only by its own hash.

Sep 13 '18 15:09 nc6

Another analysis is that listDirContents should be implemented like my splitDir combinator.

If you start with a Content Dir but then change it into a [Content File] then it feels to me that you shouldn't be able to inspect any of the Content File to work out which directory they came from. This seems to break abstraction.

There seem to be a number of ways currently to turn a Content File back into a Content Dir (without making a new store location) but it would be more consistent if each Content t lived in it's own store location.

We can also then state some invariants like, each unique file only exists once in the store so if two steps produce an identical file then they will always be identified. I think this is what you are suggesting in your second paragraph.

A potential problem with this. I noticed several places in my scripts where by coincidence, a Content File was in the same store directory as some other scripts so that it worked and depended on these scripts without that dependency being made explicit in the flow. This model also means that it's easier to get recompilation wrong (the copyDirToStore example above). The solution was to make the environment explicit by making a Content Dir by merging together some Content File. Perhaps there should be a third content type, Content (Env File) which is like a File but actually a pointer into a Content Dir. This type should only be used to pass scripts to external processes.

Do you have some use cases in mind that placing Content File in their own store path would break? It would be a bit more inconvenient if things worked like this as there would have to be a flow to access a file in a directory but much easier to avoid getting the program wrong.

Sep 14 '18 11:09 mpickering

funflow funflow copied to clipboard

listDirContents defeats caching?

funflow
funflow copied to clipboard