dendrite
dendrite copied to clipboard
Add an option to read the first N lines from a file
Pulling only the first N lines from a file would be a useful option during development, when you want to read just a handful of lines out of a very large file to test your code.
You can approximate this behavior by using a reducer which exits early:
(let [lines-read (atom 0)
wrapped-reducer (fn [acc v]
(if (>= @lines-read 10)
(do
(println "10 lines have been read")
(reduced acc))
(do (swap! lines-read inc)
(original-reducer acc v))))]
..)
However, this still takes a few seconds. From @jwhitbeck:
Indeed dendrite is currently optimized for throughput and is reading far ahead of those ten lines. The read process has two stages: (1) deserialize each active column into arrays of values (2) assemble the nested records from the flattened columnar layout. Your wrapped-reducer short-circuits (2) but doesn't impact (1).
Adding a "line-count" option is definitely the way to go, but isn't easy given the current implementation. More generally, I have a design for adding indexing/filtering capabilities and buffering less when it isn't needed which should make dendrite much snappier for interactive use.