Lifann issues

Results 12 issues of


                                            Lifann

fix: Select mpi_lib loading mode by default to adapt different platform

## Checklist before submitting - [ x ] Did you read the [contributor guide](https://github.com/horovod/horovod/blob/master/CONTRIBUTING.md)? - [ ] Did you update the docs? - [ ] Did you write any tests...

wontfix

HorovodBasics load dynamic library make grpc create channel failed with tensorflow-2.11

**Environment:** 1. Framework: (TensorFlow, Keras, PyTorch, MXNet) TensorFlow 2. Framework version: tensorflow-2.11 3. Horovod version: horovod-2.28.1 4. MPI version: openmpi-4.1.2a1-1.54103.x86_64 5. CUDA version: cuda-11.2 6. NCCL version: nccl-2.18 7. Python...

bug

[Question] Is there pipeline mechanism to help the lookup requests always be handled on device cache in HugeCTR?

### Background In the recommender system training, the user/item/history feature can be super large in production. Considering HPS as a multi-level cache, it can well store large sparse parameters, with...

question

fix: Do not reuse variable in python to avoid confict of multiple var…

…iables with different properties # Description Brief Description of the PR: When multiple variable share same name with different properties in graph mode, reusing `Variable` in python makes only the...

Hkv code draft

Draft code to apply hkv into tfra.

[Feat]Copy-free save and load for cuckoo hashtable

# Description Brief Description of the PR: Since dynamic embedding could be super large for memory limit. save and load with traditional TensorFlow checkpoint mechanism will use a lot of...

opt(insert-and-evict): thrust prefix_sum introduce cudaMalloc/cudaFre…

opt(insert-and-evict): thrust prefix_sum introduces `cudaMalloc` and `cudaFree` which make device sync. Replace it by cub API. The output of unit test case `insert-and-evict` is as follow: [ut_output.txt](https://github.com/NVIDIA-Merlin/HierarchicalKV/files/13648655/ut_output.txt)

Lifann