Abulhair Saparov comments

Results 15 comments of


                                            Abulhair Saparov

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

Not really but I think the HuggingFace folks are trying to workaround the issue since it seems to be affecting a bunch of other people. See: https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/318 But they did...

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

@bri25yu I got it working after pulling some newer code from a branch. See: https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/318#issuecomment-1195958248

Some questions

Thank you for the suggestion. Are you trying to reproduce the experiments in the paper? Or are you trying to run your own RL code using the JBW? You don't...

Some questions

Our evaluation was quite simple, and we describe it in our paper: we just plot the "reward rate" over time. Where the reward rate is defined as the total reward...

Some questions

In our experiments, the reward depends on the experiment (i.e. the task). For example, if the task is Collect[Jellybean], the agent receives +1 reward whenever it collects a jellybean item....

Some questions

Yes the Swift experiments require Swift for Tensorflow. In the README, it is listed under both the [Requirements](https://github.com/eaplatanios/jelly-bean-world#requirements) and [Using Swift](https://github.com/eaplatanios/jelly-bean-world#using-swift) sections that you need Swift for Tensorflow 0.8. We...

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

@stas00

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

@stas00 It seems to be working with `CUDA_LAUNCH_BLOCKING=1`! I'll test with `bigscience/bloom-1b3` next.

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

@stas00 Actually I just tested both `bigscience/bloom` and `bigscience/bloom-1b3` without `CUDA_LAUNCH_BLOCKING=1` and they both work. This is probably because I pulled newer code from the `bloom-inference` branch of this repo...

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

@pai4451 I didn't change any code from this repo at all. I followed the installation instructions in the [readme](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/README.md). I invoke the inference script using: `deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom`...