grafter icon indicating copy to clipboard operation
grafter copied to clipboard

Miscellaneous new dataset functions

Open RickMoynihan opened this issue 8 years ago • 5 comments

@Robsteranium I saw these in one of your pipelines...

We should consider adding some new dataset functions... or supporting their usecases somehow:

(defn drop-columns [dataset cols-to-drop]
                   (let [selected-columns (remove (set cols-to-drop) (column-names dataset))]
                     (columns dataset selected-columns)))

(defn select-columns [dataset selector-fn]
  (let [selected-columns (filter selector-fn (column-names dataset))]
    (if (empty? selected-columns)
      (throw (RuntimeException. "No columns selected"))
      (columns dataset selected-columns))))

(defn drop-where [dataset pred column-or-columns]
  (if (sequential? column-or-columns)
    (letfn [(drop-where-reverse [pred column dataset] (drop-where dataset pred column))]
      ((apply comp (for [column column-or-columns]
                     (partial drop-where-reverse pred column))) dataset))
    (transform-rows dataset (fn [rows] (remove #(pred (% column-or-columns)) rows)))))

(defn take-where [dataset pred column]
  (transform-rows dataset (fn [rows] (filter #(pred (% column)) rows))))

(defn mapcat-rows [dataset f]
  (transform-rows dataset (fn [rows] (mapcat f rows))))

RickMoynihan avatar Jul 06 '15 17:07 RickMoynihan

It would also be good to get a more complete list of missing features and functionality. Our current approach has been pretty adhoc, and I think it would be good to assemble a list of functions we've used, or might want to use... categorise them, and then see if there are any generalities or further holes we can plug.

Lets consider this ticket a dumping ground to assemble requirements, perhaps for the decomplecting selection work.

Other ones I can think of...

;; deriving multiple columns from one or more inputs...
(derive-columns ds [:firstname :surname] :name split)

RickMoynihan avatar Jul 06 '15 17:07 RickMoynihan

conj-cols conj-rows in incanter (Are they lazy?)

RickMoynihan avatar Jul 07 '15 16:07 RickMoynihan

grep is now a more general rows... makes me think we should also support grep-columns - or a better way of selection that lets you grep in either dimension or both at the same time - with all current parameterisations of grep supported in both dimensions. See also incanter sel

RickMoynihan avatar Jul 09 '15 10:07 RickMoynihan

Not a dataset -> dataset function but useful for building them is the dataset -> seq function:

(defn column->seq [ds column]
  (->> ds
       :rows
       (map #(get % column))))

This will need to be generalised to resolve column-ids like incanter.

Which gets a column as a sequence. Something like this exists in incanter; but it might eagerly consume the sequence into RAM.

RickMoynihan avatar Aug 12 '15 12:08 RickMoynihan

It might be useful for build-lookup-table to accept a default value to return when :not-found. It may also be useful for this to be a function, to e.g. return another column containing the status.

RickMoynihan avatar Aug 12 '15 13:08 RickMoynihan