erigon-lib icon indicating copy to clipboard operation
erigon-lib copied to clipboard

ETL load does not support 0 value keys

Open elee1766 opened this issue 3 years ago • 3 comments

first time using etl just now- I was scratching my head over why Load was not working, it was because i was trying to load empty values, possible with direct transaction, but seems currently seems in ETL framework it is used as the delete identifier.

https://github.com/ledgerwatch/erigon-lib/blob/main/etl/collector.go#L239

One possible solution - add TransformArg "AllowEmptyValues" to skip this check, and allow therefore "zero values"? this is something supported by MDBX. Happy to make a PR if it seems like a good idea.

example use case, filling a database with a large amount of random empty keys for performance benchmark:

func (d *Database) AddKeys(ctx context.Context, bucket string, bytesize int, keysize int) error {
	src := frand.New()
	cl := etl.NewCollector("mdbxbench", os.TempDir(), etl.NewOldestEntryBuffer(etl.BufferOptimalSize*5))
	for i := 0; i < (bytesize / keysize); i++ {
		o := make([]byte, keysize)
		src.Read(o)
		err := cl.Collect(o[:], []byte{})
		if err != nil {
			return err
		}
	}
	defer cl.Close()
	err := d.DB().Update(ctx, func(tx kv.RwTx) error {
		return cl.Load(tx, bucket, func(k, v []byte, _ etl.CurrentTableReader, next etl.LoadNextFunc) error {
			return next(k, k, []byte{}) /// it would be good if this actually added the entry, instead of deleting. For now i am just doing []byte{0}
		}, etl.TransformArgs{})
	})
	if err != nil {
		return err
	}
	return nil
}

elee1766 avatar Dec 05 '22 09:12 elee1766

It's maybe-big task. Next things needs to be done:

  • check that mdbx-go bindings do support empty values - maybe not (because LMDB didn't support them). Sub-tasks - can store empty value in: value, key, value of DupSort table.
  • check that kv_mdbx.go does support empty values
  • check that kv_remote and remotedbserver do support empty values
  • ensure that empty value it's not nil. That all API's (Seek, SeekBothRange, Get, Next, First, Last, etc...)- if you put empty value then return empty value not nil. likely need write tests for this special case - because it's easy to break.
  • etl has 3 collectors all of them need support empty value
  • etl now implemented next way: nil-value means delete this key. Need keep this feature. etl doesn't support deletes from DupSort tables.
  • take a look into rawdb package - maybe there are checks like if len(k) == 0: nil and empty value will pass this check.

AskAlexSharov avatar Dec 05 '22 11:12 AskAlexSharov

Hm, I will take one step at a time. and update this post as I investigate

* check that mdbx-go bindings do support empty values - maybe not (because LMDB didn't support them). Sub-tasks - can store `empty value` in: value, key, value of DupSort table.

it seems https://github.com/torquem-ch/mdbx-go/blob/master/mdbx/txn.go#L548 does not support empty values. replaces 0 length with a []byte{0}, then passes to Put. it seems in MDBX - reserve.iov_len = (data ? data->iov_len : 0) + sizeof(mdbx_attr_t); is used, implying that if NULL is passed as data, that would be how to set an empty value. Assuming key is the same

* check that kv_mdbx.go does support empty values

seems it passes through https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L1109, https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L1374 all the way down, so it does support empty values

as for empty keys, seems it is properly handling it, at least here: https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L535. tests will need to be written

TODO:

* check that kv_remote and remotedbserver do support empty values

* ensure that `empty value` it's not `nil`. That all API's (Seek, SeekBothRange, Get, Next, First, Last, etc...)- if you put `empty value` then return `empty value` not `nil`. likely need write tests for this special case - because it's easy to break.

* etl has 3 collectors all of them need support `empty value`

* etl now implemented next way: `nil`-value means delete this key. Need keep this feature. etl doesn't support deletes from DupSort tables.

* take a look into `rawdb` package - maybe there are checks like `if len(k) == 0`: `nil` and `empty value` will pass this check.

elee1766 avatar Dec 06 '22 23:12 elee1766

FYI: you can ignore b.AutoDupSortKeysConversion == true case

AskAlexSharov avatar Dec 07 '22 01:12 AskAlexSharov