grafter
grafter copied to clipboard
Miscellaneous new dataset functions
@Robsteranium I saw these in one of your pipelines...
We should consider adding some new dataset functions... or supporting their usecases somehow:
(defn drop-columns [dataset cols-to-drop]
(let [selected-columns (remove (set cols-to-drop) (column-names dataset))]
(columns dataset selected-columns)))
(defn select-columns [dataset selector-fn]
(let [selected-columns (filter selector-fn (column-names dataset))]
(if (empty? selected-columns)
(throw (RuntimeException. "No columns selected"))
(columns dataset selected-columns))))
(defn drop-where [dataset pred column-or-columns]
(if (sequential? column-or-columns)
(letfn [(drop-where-reverse [pred column dataset] (drop-where dataset pred column))]
((apply comp (for [column column-or-columns]
(partial drop-where-reverse pred column))) dataset))
(transform-rows dataset (fn [rows] (remove #(pred (% column-or-columns)) rows)))))
(defn take-where [dataset pred column]
(transform-rows dataset (fn [rows] (filter #(pred (% column)) rows))))
(defn mapcat-rows [dataset f]
(transform-rows dataset (fn [rows] (mapcat f rows))))
It would also be good to get a more complete list of missing features and functionality. Our current approach has been pretty adhoc, and I think it would be good to assemble a list of functions we've used, or might want to use... categorise them, and then see if there are any generalities or further holes we can plug.
Lets consider this ticket a dumping ground to assemble requirements, perhaps for the decomplecting selection work.
Other ones I can think of...
;; deriving multiple columns from one or more inputs...
(derive-columns ds [:firstname :surname] :name split)
conj-cols
conj-rows
in incanter (Are they lazy?)
grep
is now a more general rows
... makes me think we should also support grep-columns
- or a better way of selection that lets you grep in either dimension or both at the same time - with all current parameterisations of grep supported in both dimensions. See also incanter sel
Not a dataset -> dataset
function but useful for building them is the dataset -> seq
function:
(defn column->seq [ds column]
(->> ds
:rows
(map #(get % column))))
This will need to be generalised to resolve column-ids like incanter.
Which gets a column as a sequence. Something like this exists in incanter; but it might eagerly consume the sequence into RAM.
It might be useful for build-lookup-table
to accept a default value to return when :not-found
. It may also be useful for this to be a function, to e.g. return another column containing the status.