py-lmdb icon indicating copy to clipboard operation
py-lmdb copied to clipboard

Seems like does't support Chinese keys bettween named-databases?

Open zyh3826 opened this issue 2 years ago • 2 comments

Affected Operating Systems

  • Linux

Affected py-lmdb Version

1.3.0/1.0.0

py-lmdb Installation Method

sudo pip install lmdb

Machine "free -m" output

                           total        used        free      shared  buff/cache   available
Mem:         257421       46163      116679        2694       94578      207699
Swap:         32767       23075        9692

Describe Your Problem

I have some named databases, and some key-values in Chinese, when I insert them into named databases, I find that is not correct, all named databases have the same data, code:

# insert
d = {
    '1999': [['19990012', '动画片']],
    '1114': [['11140004', '动画片'], ['11140011', '冒险']],
    '1101': [['11010020', '冒险']]
}
sub_type_env = lmdb.open('./test_lmdb', map_size=1000000, max_dbs=1000)
for main_type, items in d.items():
    db = sub_type_env.open_db(pickle.dumps(main_type))
    with sub_type_env.begin(write=True, db=db) as sub_type2id_txn:
        for item in items:
            tag_id, tag_name = item
            key = tag_name.encode()
            val = tag_id.encode()
            sub_type2id_txn.put(key, val)
        print('{} -> {} -> {}'.format(main_type, len(items), sub_type2id_txn.stat()['entries']))
sub_type_env.close()

# iterate
sub_type_env = lmdb.open('./test_lmdb', map_size=1000000, max_dbs=1000)
for main_type, items in d.items():
    db = sub_type_env.open_db(pickle.dumps(main_type))
    with sub_type_env.begin(write=True, db=db) as sub_type2id_txn:
        for i, j in sub_type2id_txn.cursor():
            print('{}->{}->{}'.format(main_type, i.decode(), j.decode()))
sub_type_env.close()

output:

1999->冒险->11010020
1999->动画片->11140004
1114->冒险->11010020
1114->动画片->11140004
1101->冒险->11010020
1101->动画片->11140004

I try another encode method like pickle, but get the same results, what should I do, thanks a lot

zyh3826 avatar Apr 01 '22 06:04 zyh3826

change data to:

d = {
    1999: [[19990012, '动画片']],
    1114: [[11140004, '动画片'], [11140011, '冒险']],
    1101: [[11010020, '冒险']]
}
also, get the wrong output
1999->冒险->11010020
1999->动画片->11140004
1114->冒险->11010020
1114->动画片->11140004
1101->冒险->11010020
1101->动画片->11140004

change data to:

d = {
    1: [[1, '2'], [3, '4'], [5, '6']],
    2: [[3, '4'], [5, '6']],
    3: [[5, '6']]
}
get the correct output:
1->2->1
1->4->3
1->6->5
2->4->3
2->6->5
3->6->5

zyh3826 avatar Apr 01 '22 07:04 zyh3826

It appears that the pickle output producing \x00 characters is negatively interacting with the mdb.c's use of strlen to create the length of the index. Using str.encode() instead of pickle.dumps produces the expected output.

import lmdb
import pickle

def en(s: str) -> bytes:
    # Pickle failse
    # ret = pickle.dumps(s)
    # Encode works
    ret = s.encode()
    parts = ret.split(b"\x00")
    print(f'en ({s}) -> {ret} | {parts}')
    return ret

d = {
    '1999': [['19990012', '动画片']],
    '1114': [['11140004', '动画片'], ['11140011', '冒险']],
    '1101': [['11010020', '冒险']]
}

sub_type_env = lmdb.open('./test_lmdb', map_size=1000000, max_dbs=1000)
for main_type, items in d.items():
    db = sub_type_env.open_db(en(main_type))
    with sub_type_env.begin(write=True, db=db) as sub_type2id_txn:
        for item in items:
            tag_id, tag_name = item
            key = tag_name.encode()
            val = tag_id.encode()
            sub_type2id_txn.put(key, val)
        print('{} -> {} -> {}'.format(main_type, len(items), sub_type2id_txn.stat()['entries']))

sub_type_env.close()

# iterate
sub_type_env = lmdb.open('./test_lmdb', map_size=1000000, max_dbs=1000)
for main_type, items in d.items():
    db = sub_type_env.open_db(en(main_type))
    with sub_type_env.begin(write=False, db=db) as sub_type2id_txn:
        for i, j in sub_type2id_txn.cursor():
            print('{}->{}->{}'.format(main_type, i.decode(), j.decode()))

sub_type_env.close()

vEpiphyte avatar Apr 04 '22 15:04 vEpiphyte