Kunlun Li

Results 14 comments of Kunlun Li
trafficstars

Has the NCCL library been installed in /usr/local/cuda-11.2/... ? If yes, could you try set environment variable `NCCL_DIR=/usr/local/cuda-11.2/`?

@silpara Could you give the specific installation location of `nccl.h` and `libnccl.so`? For example, in the `nvcr.io/nvidia/tensorflow:22.05-tf2-py3` container: * `nccl.h` is at `/usr/include/nccl.h` * `libnccl.so` is at `/usr/lib/x86_64-linux-gnu/libnccl.so` The following...

It looks like cmake can't find lib of tensorflow now. SOK use these commands to locate the lib of tensorflow: ```bash python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_compile_flags()))" python...

> I also encountered the same mistake above, please ask how to solve it now? cmake cannot find nccl or tensorflow? @zongshibuzai

Could you try these commands to see what will print? ```python python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_compile_flags()))" python -c "import tensorflow as tf; print(' '.join(tf.sysconfig.get_link_flags()))" python -c "import...

Does the error logs are also like: ``` ... CMake Error at cmakes/FindTensorFlow.cmake:23 (string): string sub-command REPLACE requires at least four arguments. ... CMake Error at cmakes/FindTensorFlow.cmake:30 (string): string sub-command...

I can see why cmake fails, because we let cmake execute a subprocess([script in here](https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/sparse_operation_kit/cmakes/FindTensorFlow.cmake#L6)) like `python -c "import tensorflow as tf; print(tf.version)` to get the location of tensorflow, but...

The corresponding MCore MR was merged.

I also feel it is a very ugly approach, but I can't think of a better way to do it. Then I asked @timmoon10 if he has any insight, I...

> Ok, so let's maybe do this: > * create the option in fp8_model_init to preserve_high_precision_initialization which would be the trigger to save the copy on the CPU. We should...