fs icon indicating copy to clipboard operation
fs copied to clipboard

iterate-dir throws OOME on large directory structures

Open pmonks opened this issue 12 years ago • 3 comments
trafficstars

The iterate-dir function consumes all available heap and throws an OOME on large directory structures.

The following typescript demonstrates this problem in a couple of different ways when the function is presented with a directory containing approximately 600,000 files & sub-directories (note: embedded ANSI escape characters have been manually removed from this typescript for clarity):

Script started on Thu Jan  3 22:45:39 2013
bash-3.2$ ls -R /Users/pmonks/Development | wc -l
ls: unreadableDirectory: Permission denied
  614630
bash-3.2$ lein repl
nREPL server started on port 52181
REPL-y 0.1.0-beta10
Clojure 1.4.0
    Exit: Control+D or (exit) or (quit)
Commands: (user/help)
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
  Source: (source function-name-here)
          (user/sourcery function-name-here)
 Javadoc: (javadoc java-object-or-class-here)
Examples from clojuredocs.org: [clojuredocs or cdoc]
          (user/clojuredocs name-here)
          (user/clojuredocs "ns-here" "name-here")
fs-scan.core=> (require '[fs.core :as fs])
nil
fs-scan.core=> (defn walker [root dirs files] ())
#'fs-scan.core/walker
fs-scan.core=> (fs/walk walker "/Users/pmonks/Development")
OutOfMemoryError Java heap space  java.util.Arrays.copyOf (Arrays.java:2882)

fs-scan.core=> (fs/iterate-dir "/Users/pmonks/Development")
OutOfMemoryError Java heap space  java.util.Arrays.copyOf (Arrays.java:2882)

fs-scan.core=> (do (fs/iterate-dir "/Users/pmonks/Development") ())
OutOfMemoryError Java heap space  java.util.Arrays.copyOf (Arrays.java:2882)

fs-scan.core=> exit
Bye for now!

bash-3.2$ exit
exit

Script done on Thu Jan  3 22:53:42 2013

I believe this is occurring because iterate-dir is not lazy (despite the doc comment), and is eagerly building the entire sequence of pathnames in memory.

pmonks avatar Jan 04 '13 06:01 pmonks

For my use case, this issue appears when using the walk function. Basically I want to be able to walk very large directory structures (10s to 100s of millions of files, transitively), processing as I go.

pmonks avatar Jan 04 '13 07:01 pmonks

I see. The problem is that the zipper used under the hood holds the whole tree in memory. I'll get a fix in asap. Should just be a tree-seq (I didn't write this code. I never write code that blows the heap, you see ;)).

Raynes avatar Jan 04 '13 07:01 Raynes

;-)

Thanks for the lickety-split response - I'll keep an eye out for the update and give the new version a whirl.

pmonks avatar Jan 04 '13 07:01 pmonks