WIP: add convenience functions for selecting and sorting
Hi @sbelak,
Thanks for this library! I've been getting a lot of mileage out of it. Here are some function that I use that might be a nice addition to your library:
select-cols-regex: function that selects columns using a regular expressioncompare-by: function that returns a comparator that makes sorting in a descending or ascending fashion per keyword easy. As opposed to the normal comparator it always sorts nil values last.
Cheers!
I've added the function filter-cols. Say I want to filter columns based on some predicate like a regular expression, their membership of some collection of keywords or some composition of these predicates. In this case it would be nice to have a function like filter-cols as it simplifies the code as follows:
From
(->> df
;; change collection of columns in some way, e.g. using (derive-cols ...)
(#(select-cols (filter pred' (cols %)) % )))
to
(->> df
;; change collection of columns in some way, e.g. using (derive-cols ...)
(filter-cols pred'))
I've reimplemented select-cols-regex using this new function
Thanks for this. I really like. A couple of things:
- I'm not 100% on
select-cols-regex. Feels very specific (I'm guessing this is for messy data where you havecol_name_1...n) and minimal convenience from the more generic filter-cols - I think
select-cols-bybetter reflects semantics, as the fn operates on col names rather than values likefilterfamily does. - I love the utility of
compare-by, not entirely sold on the signature (but it might be correct). Two things feel like warts: mandatory:asc/:descfor each comparator and the fact we need these "magic" tokens. Have you considered using a combinator that flips 1 <-> -1 instead. So you'd write something like
(sort (compare-by :a (desc :b)) ...)
- Using
partitionwill probably yield cleaner code thanloop.
Thanks for your feedback :)
- I'll remove
select-cols-regex. - I'll think about how to make
compare-bya bit cleaner.
Question:
any reason you defined the function select-cols instead of using clojure.set/project? Is it because you don't want the return type to be a set?
3 reasons. In order of importance:
- sets break the ordering of the data
projectcan only be used with keywords, whileselect-colsworks with any keyfn- while the set functions currently work on non-sets that's not a guarantee
That makes sense.
I've added the function derive-cols* to convey how I'd like the derive-cols function to behave. I don't propose to include it as is.
The benefit of derive-cols* compared with the current derive-cols is that by taking ordering of the new-cols into account you can construct a new column and let that column then be the input of the next new column. The consequence is that you can write
(->> [{:a 1 :b 2}{:a 3 :b 10}]
(derive-cols* (ordered-map :c [inc :b]
:d [inc :c])))
;; => ({:a 1, :b 2, :c 3, :d 4} {:a 3, :b 10, :c 11, :d 12})
or
(->> [{:a 1 :b 2}{:a 3 :b 10}]
(derive-cols* [:c [inc :b]
:d [inc :c]]))
instead of
(->> [{:a 1 :b 2}{:a 3 :b 10}]
(derive-cols {:c [inc :b]})
(derive-cols {:d [inc :c]}))
which becomes a bother when you have a long chain of new column derivations that have dependencies on each other.
@sbelak What do you think?
I don't know much about clojure.spec yet. I'll make an attempt to implement derive-cols*, select-cols-by and compare-by in a more coherent fashion with respect to the rest of the lib.