huri icon indicating copy to clipboard operation
huri copied to clipboard

WIP: add convenience functions for selecting and sorting

Open keesterbrugge opened this issue 6 years ago • 5 comments

Hi @sbelak,

Thanks for this library! I've been getting a lot of mileage out of it. Here are some function that I use that might be a nice addition to your library:

  • select-cols-regex: function that selects columns using a regular expression
  • compare-by: function that returns a comparator that makes sorting in a descending or ascending fashion per keyword easy. As opposed to the normal comparator it always sorts nil values last.

Cheers!

keesterbrugge avatar Jan 28 '19 14:01 keesterbrugge

I've added the function filter-cols. Say I want to filter columns based on some predicate like a regular expression, their membership of some collection of keywords or some composition of these predicates. In this case it would be nice to have a function like filter-cols as it simplifies the code as follows:
From

(->> df
     ;; change collection of columns in some way, e.g. using (derive-cols ...) 
     (#(select-cols (filter pred' (cols %)) % )))

to

(->> df
     ;; change collection of columns in some way, e.g. using (derive-cols ...) 
     (filter-cols pred'))

I've reimplemented select-cols-regex using this new function

keesterbrugge avatar Jan 28 '19 16:01 keesterbrugge

Thanks for this. I really like. A couple of things:

  • I'm not 100% on select-cols-regex. Feels very specific (I'm guessing this is for messy data where you have col_name_1...n) and minimal convenience from the more generic filter-cols
  • I think select-cols-by better reflects semantics, as the fn operates on col names rather than values like filter family does.
  • I love the utility of compare-by, not entirely sold on the signature (but it might be correct). Two things feel like warts: mandatory :asc/:desc for each comparator and the fact we need these "magic" tokens. Have you considered using a combinator that flips 1 <-> -1 instead. So you'd write something like
(sort (compare-by :a (desc :b)) ...)
  • Using partition will probably yield cleaner code than loop.

sbelak avatar Jan 28 '19 18:01 sbelak

Thanks for your feedback :)

  • I'll remove select-cols-regex.
  • I'll think about how to make compare-by a bit cleaner.

Question: any reason you defined the function select-cols instead of using clojure.set/project? Is it because you don't want the return type to be a set?

keesterbrugge avatar Jan 31 '19 11:01 keesterbrugge

3 reasons. In order of importance:

  • sets break the ordering of the data
  • project can only be used with keywords, while select-cols works with any keyfn
  • while the set functions currently work on non-sets that's not a guarantee

sbelak avatar Jan 31 '19 11:01 sbelak

That makes sense.

I've added the function derive-cols* to convey how I'd like the derive-cols function to behave. I don't propose to include it as is.

The benefit of derive-cols* compared with the current derive-cols is that by taking ordering of the new-cols into account you can construct a new column and let that column then be the input of the next new column. The consequence is that you can write

(->> [{:a 1 :b 2}{:a 3 :b 10}] 
     (derive-cols* (ordered-map :c [inc :b] 
                                :d [inc :c]))) 
;; => ({:a 1, :b 2, :c 3, :d 4} {:a 3, :b 10, :c 11, :d 12})

or

(->> [{:a 1 :b 2}{:a 3 :b 10}] 
     (derive-cols* [:c [inc :b] 
                    :d [inc :c]])) 

instead of

(->> [{:a 1 :b 2}{:a 3 :b 10}] 
     (derive-cols {:c [inc :b]})
     (derive-cols {:d [inc :c]})) 

which becomes a bother when you have a long chain of new column derivations that have dependencies on each other.

@sbelak What do you think?

I don't know much about clojure.spec yet. I'll make an attempt to implement derive-cols*, select-cols-by and compare-by in a more coherent fashion with respect to the rest of the lib.

keesterbrugge avatar Feb 05 '19 16:02 keesterbrugge