lpython icon indicating copy to clipboard operation
lpython copied to clipboard

Dict implementation design

Open certik opened this issue 2 years ago • 13 comments

Here are the freedoms that we have in implementation dictionaries:

  • Hash function
  • load factor / initial size of an array
  • array resizing strategy (rehashing or not)
  • key collision strategy
  • memory allocation
  • ordered vs unordered
  • hash table vs red-black tree

Here is how we can do this:

  • [ ] We start with some defaults (just one choice for all of the above)
  • [ ] We implement more (eventually all) choices, and allow the user to select using a type annotation, such as x: Dict[hash3, loadfactor(5), resizing("x"), collision_strategy("list")].
  • [ ] We can then benchmark various combinations and choose some good solid defaults
  • [ ] We can later add a user defined hash function that gets compiled and inlined in LLVM, so optimal performance.
  • [ ] Ideally the hash function should be defined in the frontend language too

Note: #975 implements unordered dict. The ordered dict implementation can be implemented on top by adding some data structure to maintain the order of key insertion nodes (such as linked list), and another hash table of key -> node, so that you can do insertion, lookup and deletion all in O(1); it seems lookup is as fast as for unordered dict, and so is unordered iteration; but insertion and deletion is slower due to updating of the two extra datastructures; and there is a new feature "ordered iteration" (that did not exist in unordered dict, but it is slower than "unordered iteration"); more details at https://github.com/Aqcurate/OrderedDict#ordered-dictionary-implementation.

certik avatar Aug 17 '22 13:08 certik

Adding TODOs here after #975 is merged (in decreasing priority order),

  • [x] Create iterators using builder->CreateAlloca for LLVMList and LLVMDict outside user loops. In case we enter a loop in LLVM backend then let the list_api, dict_api so that they can set their iterators before hand - https://github.com/lcompilers/lpython/pull/986
  • [x] Benchmarks dict right after the above issue is resolved - https://gist.github.com/czgdp1807/e7f7b6ca52c57b16b27ec8d0259c6d4a
  • [x] Implement separate chaining key collision resolution using arrays of arrays (basically wrapped under lists so that we can re-use a lot of code already written for lists). - https://github.com/lcompilers/lpython/pull/990 (uses only one benefit of separate chaining, separate chaining in original form has to be implemented because it provides a lot of benefits other than the one of avoiding probing for keys which haven’t collided with any other), https://github.com/lcompilers/lpython/pull/1066
  • [x] Implement, delete a.k.a erase key feature in dict for all the collision resolution strategies. - https://github.com/lcompilers/lpython/pull/1011, https://github.com/lcompilers/lpython/pull/1038
  • [x] Try different hash functions for different key types and set default after benchmarking - https://github.com/lcompilers/lpython/pull/1037, https://github.com/lcompilers/lpython/pull/1066
  • [ ] Support nested dictionaries, tuples as keys, list as values
  • [ ] Support passing dict as function arguments and return value of functions
  • [ ] Port LLVM implementation to C++ to see if the algorithm is capable to beat C++ std::unordered_map.
  • [ ] Implement lazy re-hashing. Very tricky to implement but still worth trying out.
  • [ ] Compare lazy re-hashing and re-hashing all at once.
  • [ ] Try abstracting out the common parts of LLVMList and LLVMDict APIs - Not so important but if time permits can be tried.
  • [ ] Exposing load_factor to the user so that they can set it according to their needs. Similarly for hash functions.

czgdp1807 avatar Aug 17 '22 15:08 czgdp1807

Question - Should we allow floats/reals as keys?

The reason why I am asking this question is because floats are always approximately equal. So, converting them to int of same size by memcpy won't work well. Plus, its not a good practice to use floats as keys. 123.000 and 123.000001 are approximately same, now user may expect them to work as a single key but due to slight difference both will give totally different hash value. So disallowing these as keys will prevent a lot of bad practices.

Also, in case of collisions we will go for exact equality checks in keys but as I said above floats are approximately equal mostly, so users may get very random results due to this property of floats.

Even if we want to allow floats as keys, we should ask the user to provide an equality function in the dict type which we should use to compare floats in case of collisions.

See https://softwareengineering.stackexchange.com/a/391107 and https://diego.assencio.com/?index=67e5393c40a627818513f9bcacd6a70d for details.

czgdp1807 avatar Aug 24 '22 10:08 czgdp1807

For now I would not. Later once we allow a use defined hash, we can, but this is rarely used in Python anyway, so for now let's just give a compiler error.

certik avatar Aug 24 '22 13:08 certik

Hey everyone. I want to work on development and enhancements of Data Structures for LPython. Can anyone suggest a good-first-feature associated with this? Specifically, I want to explore the LLVM part of the compiler. @czgdp1807 @certik

#983 seems to be dormant from some time. So, anything which I can help with?

harshsingh-24 avatar Mar 16 '23 08:03 harshsingh-24

Support passing dict as function arguments and return value of functions

This one may be.

czgdp1807 avatar Mar 16 '23 08:03 czgdp1807

For Support of Passing Dictionary as Function Arguments and Return Value of Functions, I realized a few things which I would like to jot down. I will break the problem statement into two parts which will be independent of each other.

  • Allowing dict as function argument
def addDict(a: dict[i32, i32]):
    print(a[1])

print(addDict({1: 100, 2:400}))

For the above code, when I sequentially ran LPython on it with flags --show-tokens, show-ast and show-asr, it fails on show-asr which means there is a problem from AST to ASR conversion. Therefore, need to write logic for that first.

  • Allowing dict as function return
def addPair(a: i32, b: i32) -> dict[i32, i32]:
    return {a: b}

x : dict[i32, i32]
x = addPair(2, 1000)

print(x[2])

For the above code, when I again ran sequentially with the above mentioned flags it fails on show-llvm tag with a code generation error, which implies that I need to modify asr_to_llvm.cpp for IR generation of Dictionary Return.

@czgdp1807 Sounds good? If possible, can you point me to the associated files with above Issue?

harshsingh-24 avatar Mar 16 '23 11:03 harshsingh-24

Yes. Makes sense to me.

czgdp1807 avatar Mar 16 '23 11:03 czgdp1807

If possible, can you point me to the associated files with above Issue?

This issue is to be handled in asr_to_llvm.cpp. If the ASR doesn't seem to be correct, then we have to redesign the handling of ASR in python_ast_to_asr.cpp

It seems the Dict return type is not implemented in your case: https://github.com/lcompilers/lpython/blob/c7927193ce644aee87b7b5171465962664b058cc/src/libasr/codegen/asr_to_llvm.cpp#L3756

Thirumalai-Shaktivel avatar Mar 16 '23 16:03 Thirumalai-Shaktivel

Hey. I was able to figure out a few things as to how to implement the solution of Return Type at least. First of all, there are two apis that have been exposed by LLVM to us - dict_api_lp (Linear Probing), dict_api_sc (Separate Chaining). set_dict_api is used for setting which API to use depending on what kind of objects have been passed.

In the above mentioned place by @Thirumalai-Shaktivel, I added the following logic for a return-type of Dictionary.

case (ASR::ttypeType::Dict) : {
                    ASR::Dict_t* asr_dict = ASR::down_cast<ASR::Dict_t>(return_var_type0);
                    std::string key_type_code = ASRUtils::get_type_code(asr_dict->m_key_type);
                    std::string value_type_code = ASRUtils::get_type_code(asr_dict->m_value_type);
                    
                    std ::cout << "DICT DEBUG: " << key_type_code << " " << value_type_code << std :: endl;
                    
                    bool is_local_array_type = false, is_local_malloc_array_type = false;
                    bool is_local_list = false;
                    ASR::dimension_t* local_m_dims = nullptr;
                    ASR::storage_typeType local_m_storage = ASR::storage_typeType::Default;
                    int local_n_dims = 0, local_a_kind = -1;

                    llvm::Type* key_llvm_type = get_type_from_ttype_t(asr_dict->m_key_type, local_m_storage,is_local_array_type, is_local_malloc_array_type,is_local_list, local_m_dims, local_n_dims,local_a_kind);
                    llvm::Type* value_llvm_type = get_type_from_ttype_t(asr_dict->m_value_type, local_m_storage,is_local_array_type, is_local_malloc_array_type,is_local_list, local_m_dims, local_n_dims,local_a_kind);
                    int32_t key_type_size = get_type_size(asr_dict->m_key_type, key_llvm_type, local_a_kind);
                    int32_t value_type_size = get_type_size(asr_dict->m_value_type, value_llvm_type, local_a_kind);

                    set_dict_api(asr_dict);

                    return_type = llvm_utils->dict_api->get_dict_type(key_type_code, value_type_code, key_type_size,value_type_size, key_llvm_type, value_llvm_type);

                    // throw CodeGenError("DICT RETURN NOT IMPLEMENTED YET");
                    break;
                }

@certik @czgdp1807 After this, what else needs to be changed? I guess visit_Variable function also needs to changed, because it throws an error - throw CodeGenError("Variable type not supported " + std::to_string(x.m_type->type), x.base.base.loc);? We might need to update llvm_symtab?

harshsingh-24 avatar Mar 18 '23 15:03 harshsingh-24

It seems like you have provided a description of the different freedoms we have in implementing dictionaries and some suggestions on how to approach implementing them.

There are different choices that can be made in implementing dictionaries, such as the hash function, load factor, resizing strategy, collision strategy, memory allocation, ordered vs unordered, and hash table vs red-black tree.

A good starting point is to choose default options for all of the above choices.

To allow users to select their own options, a type annotation can be used, such as x: Dict[hash3, loadfactor(5), resizing("x"), collision_strategy("list")].

Benchmarking can be used to choose good solid defaults.

A user-defined hash function can be added later for optimal performance.

Ideally, the hash function should be defined in the frontend language as well.

Ordered dictionaries can be implemented on top of unordered dictionaries by maintaining a data structure to track the order of key insertion nodes and another hash table of key -> node, but this may result in slower insertion and deletion due to the additional data structures. Overall, it seems like you have provided a good starting point for designing and implementing dictionaries with different options and flexibility.

rsaba5612 avatar Mar 29 '23 20:03 rsaba5612

Hey!! Can you give an example of Support nested dictionaries, tuples as keys, list as values which I can initially work on and then incrementally I will add support for each such type. Basically, my doubt is in what order should I go implementing the above feature? @czgdp1807 @certik

I feel that I should go with nested dicts as value support first and then should try for other data types? As per my knowledge, A Dictionary keys of a Dictionary must be unique and of immutable data type such as Strings, Integers, Floats, tuples (Hashable Types) but the values can be repeated and be of any type.

harshsingh-24 avatar Apr 07 '23 18:04 harshsingh-24

@czgdp1807 I have completed the task Support passing dict as function arguments and return value of functions in the above checklist. What should I work on next? @certik

harshsingh-24 avatar Apr 09 '23 15:04 harshsingh-24

Hello @czgdp1807 is this project still available for GSoC or is this completed?

aryangupta701 avatar Mar 23 '24 18:03 aryangupta701