datahike
datahike copied to clipboard
[Bug]: Dathike db keeps growing with no history enabled.
What version of Datahike are you using?
0.4.1480, with konserve "0.6.0-alpha3"
What version of Java are you using?
17
What operating system are you using?
libre-linux, guix
What database EDN configuration are you using?
{:store {:backend :file :path "data/datahike-db"} :keep-history? false}
Describe the bug
I am importing data from various apis. For each api request, I save the entire received data to an edn file. Then I update konserve store with index [:invoice id] [:po id] [:product id] The konserve store essentially has all the information I get from the apis. Then I convert each received entity and store the information I care about in datahike.
When I import the same data multiple times to datahike, the datahike db keeps growing. My edn files and the konserve store keeps at the same size though.
This is the entity count table after the initial import:
|-------------------+-------|
| Type | Count |
|-------------------+-------|
| :distributor | 54 |
| :product | 212 |
| :woo-order | 77 |
| :invoice | 3704 |
| :lineitem | 12436 |
| :lineitem-invoice | 11937 |
| :lineitem-po | 499 |
| :tracking | 39 |
| :po | 164 |
|-------------------+-------|
This is the size as determined by ncdu
62.9 MiB [###################] /datahike-db
12.1 MiB [### ] /import
2.7 MiB [ ] /konserve-db
When I run the SAME import again, the entity count table is the same, but the size of datahike db grows:
169.9 MiB [###################] /datahike-db
12.1 MiB [# ] /import
2.7 MiB [ ] /konserve-db
The "mistake" I make is that I write to datahike new data that is essentially the same data as it was before. But since I have history disabled, the outcome what is stored in datahike should be the same.
I guess what happens is that the hitchhiker tree gets "fragmented".
My update conversion routine essentially on each import for each invoice deletes all old lineitems, and then re-creates them, so in the database there will be a churn of entitites.
What is the expected behaviour?
I would not expect that the database size grows that much. Is there a function that can defragment the datahike db?
How can the behaviour be reproduced?
(ns crb.db.update.simple
(:require
[taoensso.timbre :as timbre :refer [debug info error]]
[crb.db.datahike :refer [transact] :as dhdb]))
(defn existing-db-id [db-attr id-val]
(debug "search id col:" db-attr " id: " id-val)
(let [query '[:find [?id ]
:in $ db-attr id-val
:where
[?id db-attr id-val]]
d (dhdb/q query db-attr id-val)]
(when d
(first d))
))
(defn update-datahike [{:keys [id-data-fn id-db-key convert-fn] :as e} data]
(let [id (id-data-fn data)
_ (debug "id data: " id)
id-db (existing-db-id id-db-key id)
_ (debug "id db: " id-db)
t-data (convert-fn data)
_ (debug "tx-data: " (pr-str t-data))
t (if id-db
(assoc t-data :db/id id-db)
(assoc t-data id-db-key id))
_ (debug "tx: " (pr-str t))
t-result (dhdb/transact [t])
id-db-new (-> (:tx-data t-result) first first)
]
(debug "transact result: " t-result)
(debug "tx data: " (:tx-data t-result))
(if id-db
{:update id-db}
{:insert id-db-new })
))
(ns crb.db.update
(:require
[taoensso.timbre :as timbre :refer [debug info warn error]]
[crb.db.konserve :as k]
[crb.db.update.simple :as simple]
; data
[crb.db.update.distributor :as distributor]
[crb.db.update.xero-product :as xero-product]
[crb.db.update.xero-invoice :as xero-invoice]
[crb.db.update.xero-po :as xero-po]
[crb.db.update.woo :as woo]
[crb.db.update.aftership-tracking :as aftership]))
(def types
{; reference data
:distributor {:id-data-fn distributor/distributor-id
:id-db-key :distributor/name
:convert-fn distributor/convert-distributor}
:xero-product {:id-data-fn :ItemID
:id-db-key :product/id
:convert-fn xero-product/convert-product}
:xero-contact {:id-data-fn :ContactID}
; :xero-group identity ; todo change this!
; woo
:woo-order {:id-data-fn :number
:id-db-key :woo-order/id
:convert-fn woo/convert-woo-order}
:woo-product {:id-data-fn :id}
; xero transactions
:xero-invoice {:id-data-fn :InvoiceID
:id-db-key :invoice/id
:convert-fn xero-invoice/convert-invoice }
:xero-po {:id-data-fn :PurchaseOrderID
:id-db-key :po/id
:convert-fn xero-po/convert-po }
:aftership-tracking {:id-data-fn :id
:id-db-key :tracking/id
:convert-fn aftership/convert-tracking}
;
})
(defn update-konserve [t {:keys [id-data-fn]} item]
(let [id (id-data-fn item)
key [t id]]
(k/save key item)))
(defn update-datahike [t {:keys [id-data-fn id-db-key convert-fn] :as e} item]
(when (and id-data-fn id-db-key convert-fn)
(debug "processing update for: " t)
(-> (simple/update-datahike e item)
(assoc :type t))))
(defn update-type [t e item]
; 1. konserve update
(update-konserve t e item)
(let [; 2. datahike update - general
r (update-datahike t e item)]
; 3. datahike update - type specific
(case t
:xero-invoice (xero-invoice/update-lineitems item)
:xero-po (xero-po/update-lineitems item)
:woo-product (woo/enhance-product-retail-price item)
true)
; return result
(debug "update datahike result: " r)
r))
(defn process-update [t item]
(if-let [e (t types)]
(update-type t e item)
{:type :unknown
:error (str "not found: " t)}))
(ns crb.db.update.xero-lineitem
(:require
[taoensso.timbre :as timbre :refer [debug info warn error]]
[crb.db.datahike :refer [q] :as dhdb]
[crb.domain.invoice.date :refer [xero-str-date->instant]]))
(defn if-exists [k v]
(if v {k v} {}))
; lineitem for PO
; {:LineItemID "1253a9c1-75e7-4d82-afab-deed97685b41"
; :UnitAmount 16.05 :LineAmount 1540.8
; :TaxType "NONE" :TaxAmount 0.0 :Tracking []
; :AccountCode "I-CHEM", :Description "Dry Carpet Cleaning Compound 15lbs.",
; :Quantity 96.0, :ItemCode "AB15"}
; lineitem for invoice
;{:LineItemID "878c1c09-1061-4dad-8dc4-a366a903e2ea"
; :UnitAmount 500.0 :LineAmount 500.0
; :TaxAmount 0.0 :TaxType "NONE" :Tracking []
; :AccountCode "I-MACHINES", :Description "TM5 20 Machine"
; :Quantity 1.0, :Item {:ItemID "1de9fcd4-268e-4b2b-96cc-a4aa44cfe1a8", :Name "TM5 20” Dry Carpet and Hard Floor Cleaning Machine"
; :Code "TM5-C" :ItemCode "TM5-C"}
(defn convert-lineitem [link-type link-id {:keys [LineItemID UnitAmount Quantity Item ItemCode] :as line-item}]
(debug "converting lineitem: " line-item)
(let [core (merge {:lineitem/id LineItemID}
(case link-type
:invoice {:lineitem/invoice [:invoice/id link-id]}
:po {:lineitem/po [:po/id link-id]}
{})
(if-exists :lineitem/price UnitAmount)
(if-exists :lineitem/qty Quantity))
{:keys [ItemID]} Item
item (if Item
(assoc core :lineitem/product [:product/id ItemID]) ; invoices have ItemID
(if ItemCode
(do (debug "linking by sku: " ItemCode)
(assoc core :lineitem/product [:product/sku ItemCode])) ; po has only sku
(do (debug "no product for: " line-item)
core)
))
]
item
))
(def xero-lineitems-for-invoice
'[:find [?li ...]
:in $ invoice-id
:where
[?id :invoice/id invoice-id]
[?li :lineitem/invoice ?id]
])
(def xero-lineitems-for-po
'[:find [?li ...]
:in $ po-id
:where
[?id :po/id po-id]
[?li :lineitem/po ?id]
])
(defn tx-remove-lineitem-id [lineitem-id]
;[:db/retractEntity [:lineitem/invoice lineitem-id]]
[:db/retractEntity lineitem-id]
)
(defn tx-remove-lineitems-for [type link-id]
(let [query (case type
:invoice xero-lineitems-for-invoice
:po xero-lineitems-for-po)
old-lineitem-ids (q query link-id)]
(debug "old lineitem-ids: " old-lineitem-ids)
(map tx-remove-lineitem-id old-lineitem-ids)))
I did add to my statistics a count for transactions, just to see if the growth is because of new transactions added. But there are no transactions in the db.
|-------------------+-------|
| Type | Count |
|-------------------+-------|
| :distributor | 54 |
| :product | 212 |
| :woo-order | 77 |
| :invoice | 3704 |
| :lineitem | 12436 |
| :lineitem-invoice | 11937 |
| :lineitem-po | 499 |
| :tracking | 39 |
| :po | 164 |
| :db-transaction | |
|-------------------+-------|
I have made a dirty hack which works: I use [datahike.migrate :refer [export-db import-db]] So I do export-db, then delete the folder, and then import-db. This shrinks my db from 400 mb to 30 mb. Must be some kind of fragmentation of the hitchhiker tree that happens if I update lots of fields.
I think this might be because of some upsert operations. We'll take a look at that.
With the new commited versions .. should this be better? Also l would like to know if I generate transactiobs that have unchanged data .. will this produce a db change? In other words .. does the client have to be smart or is the db smart enough?
There is now an experimental GC which will collect old versions in case you don't want them. @awb99 Can you check and report whether this solves your issue?
I will close this for now as GC should solve the issue. If not please reopen.