BUG insert dataset depends on INTERNAL ds order.
I found a weird bug when inserting a dataset via (duckdb/insert-dataset! (:conn session) ds)
Sometimes the insert fails, even though all columns are the same and all data-types are the same
as defined via (duckdb/create-table! (:conn session) ds)
From printing the initial dataset that was used to create the table, I have deducted the order in which tml-ds has arranged the columns internally. I replicate this via (order-columns-strange ds). When I want call insert-dataset! via a differently arranged dataset, for example: (order-columns ds), then the insert throws an exception.
This is a weird behavior which I have not have expected.
This is the code I used to get the inserts to work / not-work:
(defn order-columns [ds]
(tc/dataset [(:date ds)
(:asset ds)
(:open ds)
(:high ds)
(:low ds)
(:close ds)
(:volume ds)
(:epoch ds)
(:ticks ds)]))
(defn order-columns-strange [ds]
(tc/dataset [(:open ds)
(:epoch ds)
(:date ds)
(:close ds)
(:volume ds)
(:high ds)
(:low ds)
(:ticks ds)
(:asset ds)
]))
Yes I haven't verified this but the duckdb insert operation should be independent of the dataset column order. Seems odd.
What exactly is the error?
Confirmed, maybe this can lead to a test case:
user> (require '[tech.v3.dataset :as ds])
nil
user> (require '[tmducken.duckdb :as duckdb])
nil
user> (duckdb/initialize! {:duckdb-home "binaries"})
Feb 17, 2024 12:01:37 PM clojure.tools.logging$eval13663$fn__13666 invoke
INFO: Attempting to load duckdb from "binaries/libduckdb.so"
true
user> (def db (duckdb/open-db))
#'user/db
user> (def conn (duckdb/connect db))
#'user/conn
user> (def ds (ds/select-columns (ds/->dataset {:a [1 2 3] :b ["x" "y" "z"]}) [:a :b]))
#'user/ds
user> ds
_unnamed [3 2]:
| :a | :b |
|---:|----|
| 1 | x |
| 2 | y |
| 3 | z |
user> (duckdb/create-table! conn ds)
"_unnamed"
user> (duckdb/insert-dataset! conn ds)
3
user> (duckdb/sql->dataset conn "select * from _unnamed")
Feb 17, 2024 12:03:50 PM clojure.tools.logging$eval13663$fn__13666 invoke
INFO: Reference thread starting
:_unnamed [3 2]:
| a | b |
|--:|---|
| 1 | x |
| 2 | y |
| 3 | z |
user> (def db (duckdb/open-db)) ;; FRESH NEW DB
#'user/db
user> (def conn (duckdb/connect db))
#'user/conn
user> (duckdb/create-table! conn ds)
"_unnamed"
user> (def swizzled (ds/select-columns ds [:b :a]))
#'user/swizzled
user> swizzled
_unnamed [3 2]:
| :b | :a |
|----|---:|
| x | 1 |
| y | 2 |
| z | 3 |
user> (duckdb/insert-dataset! conn swizzled)
Execution error at tmducken.duckdb/insert-dataset!$fn$check-error (duckdb.clj:321).
{:address 0x00007FF29101D730 }
The work around for this is to use a prepared statement - this may take a moment to fix as it doesn't appear that duckdb exposes tableinfo from the connection to the C api at this time.
This is a namespace that can be used to reproduce the bug I posted:
(ns notebook.playground.bardb.bug-duckdb
(:require
[tick.core :as t]
[tablecloth.api :as tc]
[tmducken.duckdb :as duckdb]))
(def db (duckdb/open-db "/tmp/duckdb-test"))
(def conn (duckdb/connect db))
(defn make-ds [v]
(let [table-name "test"]
(-> (tc/dataset v)
(tc/set-dataset-name table-name))))
(def empty-ds (make-ds [{:date (t/now)
:open 0.0 :high 0.0 :low 0.0 :close 0.0
:volume 0.0 ; crypto volume is double.
:ticks 0
:asset "000"}]))
empty-ds
(duckdb/create-table! conn empty-ds)
(def data-ds (make-ds [{:date (t/now)
:open 1.0 :high 2.0 :low 3.0 :close 4.0
:volume 0.0 ; crypto volume is double.
:ticks 0
:asset "1"}]))
(duckdb/insert-dataset! conn data-ds)
;; => 1
(def data2-ds (make-ds [{:asset "2"; NOTE asset is the first key
:date (t/now)
:open 1.0 :high 2.0 :low 3.0 :close 4.0
:volume 0.0 ; crypto volume is double.
:ticks 0
}]))
(duckdb/insert-dataset! conn data2-ds)
;; => Execution error at tmducken.duckdb/insert-dataset!$fn$check-error (duckdb.clj:320).
;; {:address 0x00007F53A4467DB0 }