tablecloth icon indicating copy to clipboard operation
tablecloth copied to clipboard

separate-columns with default target naming

Open genmeblog opened this issue 1 year ago • 1 comments

https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/seperate.20with.20custom.20fn

genmeblog avatar Oct 09 '22 22:10 genmeblog

This will be a breaking change (minor). By default source column will be replaced by the new one, on every case.

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y))
;; => _unnamed [1 8]:
;;    | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;;    |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;;    |  1 |    2 |    3 |    9 |   10 |   11 |   22 |   33 |

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y reverse))
;; => _unnamed [1 8]:
;;    | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;;    |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;;    |  1 |   33 |   22 |   11 |   10 |    9 |    3 |    2 |

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y (fn [input]
                             (zipmap "somenames" input))))
;; => _unnamed [1 7]:
;;    | :x |  a | s |  e |  m |  n | o |
;;    |---:|---:|--:|---:|---:|---:|--:|
;;    |  1 | 22 | 2 | 10 | 33 | 11 | 3 |

genmeblog avatar Oct 10 '22 10:10 genmeblog

I am know wondering if this use case should be handled by "tc/seperate-column" or if it requires a complete new method, for performance reasons. The seq in your example [2 3 9 10 11 22 33] could be as well a double arrays, like this:

(def ds
  (-> (tc/dataset {:x [1] :y [(double-array [2 3 9 10 11 22 33])]})))

And to separate this (specialy when large) could be done optimized in this way:

(->
 (tech.v3.datatype/concat-buffers (:y ds))
 (tech.v3.tensor/reshape [(tc/row-count ds)
                          (-> ds :y first count)])
 (tech.v3.dataset.tensor/tensor->dataset))

(+ replacing the column: y in the ds with the news ds)

I suppose this is significantly faster then a generic "separate" implementation you have intc/seperate It works as well for the persistent vector case above

test cases could be those:

(def ds-1
  (-> (tc/dataset {:x [1 2] :y [[2 3 9 10 11 22 33]
                                [2 3 9 10 11 22 33]]})))

(def ds-2
  (-> (tc/dataset {:x [1] :y [(double-array [2 3 9 10 11 22 33])]})))

(def ds-3
  (-> (tc/dataset {:x [1] :y [(list 2 3 9 10 11 22 33)]})))


(->
 (tech.v3.datatype/concat-buffers (:y ds-1))
 (tech.v3.tensor/reshape [(tc/row-count ds-1)
                          (-> ds-1 :y first count)])
 (tech.v3.dataset.tensor/tensor->dataset))

behrica avatar Oct 22 '22 21:10 behrica

In some cases we want even to get the tensor back and not the data frame, so omit the last tensor->dataset call.

I think it is a usefull addition in tablecloth, often we go from a dataset to a conceptual 2-d matrix. (but having the matrix rows inside a single dataset column)

behrica avatar Oct 22 '22 21:10 behrica

Not sure about the reverse. So starting from a dataset with several (numeric) columns, and suqeze them into a single column of native arrays.

behrica avatar Oct 22 '22 22:10 behrica

For the reverse something like this is working, not sure if optimal:


(def ds
  ;; => _unnamed [3 2]:
  ;;    | :x-0 | :x-1 |
  ;;    |-----:|-----:|
  ;;    |    1 |    4 |
  ;;    |    2 |    5 |
  ;;    |    3 |    6 |
  (->
   (tc/dataset {:x-0 [1 2 3]
                :x-1 [4 5 6]})))
                


(def rows
  (->
   (tech.v3.datatype/concat-buffers (tc/columns ds))
   (tech.v3.tensor/reshape [(tc/column-count ds)
                            (tc/row-count ds)])
   (tech.v3.tensor/transpose [1 0])
   (tech.v3.tensor/rows)))

(tc/dataset {:x (map tech.v3.datatype/->double-array rows)})
;; => _unnamed [3 1]:
;;    |          :x |
;;    |-------------|
;;    | [D@1600011f |
;;    |  [D@fc74513 |
;;    | [D@20c51970 |

behrica avatar Oct 22 '22 22:10 behrica

I would think that a pair of functions to go from one representation to the other would be useful.

behrica avatar Oct 22 '22 22:10 behrica

Looks like it's very specific case, kind of transpose of matrix. I'm not sure if it belongs to TC.

The last case (reverse) can be done with join-columns and {:result-type double-array}

BTW, does tensor work on non-numerical data.

genmeblog avatar Oct 24 '22 07:10 genmeblog

My original solution landed in 6.103

genmeblog avatar Oct 24 '22 08:10 genmeblog

Numeric only. I think there should be 2 methods for this in TC, they operate on Dataset. Its a specific form of separate.

behrica avatar Oct 24 '22 12:10 behrica

Numeric only. I think there should be 2 methods for this in TC, they operate on a Dataset. Its a specific form of separate.and require array of same type and length in each row. I can do PR, as I have a use case.

behrica avatar Oct 24 '22 13:10 behrica

But indeed goes into numeric stuff and going from a datset to a matrix

behrica avatar Oct 24 '22 13:10 behrica

I will try it out forward and backward. I hve the impressions, without proof, that my code above could be far more performant, but having some constraints.

I will measure it on a larger case.

behrica avatar Oct 24 '22 19:10 behrica

As I thought. On a 1000 * 1000 double matrix-type of dataset:

(def ds (api/dataset {:x (map 
                          (fn [_] (double-array (range 1000)))
                          (range 1000))}))

we get factor 50 - 100 of execution time difference

(defn use-separate []
 (api/separate-column ds :x))

(defn use-reshape []
 (->
  (tech.v3.datatype/concat-buffers (:x ds))
  (tech.v3.tensor/reshape [(api/row-count ds)
                           (-> ds :x first count)])
  (tech.v3.dataset.tensor/tensor->dataset)))


(time (def _ (use-separate)))
;; Elapsed time: 3371.491881 msecs"
(time (def _ (use-reshape)))
;; "Elapsed time: 76.420533 msecs"

for producing the same dataset.

behrica avatar Oct 24 '22 19:10 behrica

The reverse ie less of a difference, still factor 5:

(def ds-with-cols (use-reshape))

(time
 (def _  (api/join-columns ds-with-cols :x (api/column-names ds-with-cols) {:result-type double-array})))
;; elapsed time: 333.478279 msecs"
;;
;;
;;

(time
 (let [rows
       (->
        (tech.v3.datatype/concat-buffers (api/columns ds-with-cols))
        (tech.v3.tensor/reshape [(api/column-count ds-with-cols)
                                 (api/row-count ds-with-cols)])
        (tech.v3.tensor/transpose [1 0])
        (tech.v3.tensor/rows))]
   (api/dataset {:x (map tech.v3.datatype/->double-array rows)})))
;; "Elapsed time: 66.384538 msecs"

behrica avatar Oct 24 '22 20:10 behrica

But I was wrong above, the code works as well with non numeric..

behrica avatar Oct 24 '22 20:10 behrica

Yes, join-columns and separate-column are slow. I know that. These two funcitons are more general than just packing/unpacking sequence to/from column(s). join-columns and separate-column are more-less the same as tidyr's extract, separate and unite functions.

Your example is just one special case - which can be optimized for sure. If you have an idea for PR - it's always welcome.

genmeblog avatar Oct 24 '22 21:10 genmeblog