ETL load does not support 0 value keys
first time using etl just now- I was scratching my head over why Load was not working, it was because i was trying to load empty values, possible with direct transaction, but seems currently seems in ETL framework it is used as the delete identifier.
https://github.com/ledgerwatch/erigon-lib/blob/main/etl/collector.go#L239
One possible solution - add TransformArg "AllowEmptyValues" to skip this check, and allow therefore "zero values"? this is something supported by MDBX. Happy to make a PR if it seems like a good idea.
example use case, filling a database with a large amount of random empty keys for performance benchmark:
func (d *Database) AddKeys(ctx context.Context, bucket string, bytesize int, keysize int) error {
src := frand.New()
cl := etl.NewCollector("mdbxbench", os.TempDir(), etl.NewOldestEntryBuffer(etl.BufferOptimalSize*5))
for i := 0; i < (bytesize / keysize); i++ {
o := make([]byte, keysize)
src.Read(o)
err := cl.Collect(o[:], []byte{})
if err != nil {
return err
}
}
defer cl.Close()
err := d.DB().Update(ctx, func(tx kv.RwTx) error {
return cl.Load(tx, bucket, func(k, v []byte, _ etl.CurrentTableReader, next etl.LoadNextFunc) error {
return next(k, k, []byte{}) /// it would be good if this actually added the entry, instead of deleting. For now i am just doing []byte{0}
}, etl.TransformArgs{})
})
if err != nil {
return err
}
return nil
}
It's maybe-big task. Next things needs to be done:
- check that mdbx-go bindings do support empty values - maybe not (because LMDB didn't support them). Sub-tasks - can store
empty valuein: value, key, value of DupSort table. - check that kv_mdbx.go does support empty values
- check that kv_remote and remotedbserver do support empty values
- ensure that
empty valueit's notnil. That all API's (Seek, SeekBothRange, Get, Next, First, Last, etc...)- if you putempty valuethen returnempty valuenotnil. likely need write tests for this special case - because it's easy to break. - etl has 3 collectors all of them need support
empty value - etl now implemented next way:
nil-value means delete this key. Need keep this feature. etl doesn't support deletes from DupSort tables. - take a look into
rawdbpackage - maybe there are checks likeif len(k) == 0:nilandempty valuewill pass this check.
Hm, I will take one step at a time. and update this post as I investigate
* check that mdbx-go bindings do support empty values - maybe not (because LMDB didn't support them). Sub-tasks - can store `empty value` in: value, key, value of DupSort table.
it seems https://github.com/torquem-ch/mdbx-go/blob/master/mdbx/txn.go#L548 does not support empty values. replaces 0 length with a []byte{0}, then passes to Put. it seems in MDBX - reserve.iov_len = (data ? data->iov_len : 0) + sizeof(mdbx_attr_t); is used, implying that if NULL is passed as data, that would be how to set an empty value. Assuming key is the same
* check that kv_mdbx.go does support empty values
seems it passes through https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L1109, https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L1374 all the way down, so it does support empty values
as for empty keys, seems it is properly handling it, at least here: https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L535. tests will need to be written
TODO:
* check that kv_remote and remotedbserver do support empty values * ensure that `empty value` it's not `nil`. That all API's (Seek, SeekBothRange, Get, Next, First, Last, etc...)- if you put `empty value` then return `empty value` not `nil`. likely need write tests for this special case - because it's easy to break. * etl has 3 collectors all of them need support `empty value` * etl now implemented next way: `nil`-value means delete this key. Need keep this feature. etl doesn't support deletes from DupSort tables. * take a look into `rawdb` package - maybe there are checks like `if len(k) == 0`: `nil` and `empty value` will pass this check.
FYI: you can ignore b.AutoDupSortKeysConversion == true case