alpa icon indicating copy to clipboard operation
alpa copied to clipboard

IndexError: `InlinedVector::at(size_type) const` failed bounds check

Open TonyTangYu opened this issue 2 years ago • 1 comments

Please describe the bug Hello Alpa team, I tried the benchmark in my system with 8 GPUs. When I try the command 'python benchmark --suite gpt.grid_search_auto' , I run into the error shown in the figure. I checked the printed information, this error happens in the compiling process of all stages when profiling for submesh (1, 4). There are no errors in the profiling process of submesh (1, 8).

System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): Linux Ubuntu 16.04 with 8 GPUs.
  • Python version: 3.7.12
  • CUDA version: cuda 11.1
  • NCCL version: 2.8.4
  • cupy version: cupy-cuda111 11.0.0
  • GPU model and memory: A100 80GB
  • Alpa version: 1.0.0.dev0
  • TensorFlow version: 2.9.1
  • JAX version: 0.3.5

To Reproduce Steps to reproduce the behavior:

  1. python gen_prof_database.py --max-comm-size-intra-node 32 --max-comm-size-inter-node 29
  2. python benchmark --suite gpt.grid_search_auto
  3. See error

Screenshots 截屏2022-11-26 19 14 23

Could you please help me out of it? Thanks a lot.

TonyTangYu avatar Nov 27 '22 01:11 TonyTangYu

请问您解决了吗,我也出现了这个错误

caixiiaoyang avatar Sep 20 '23 10:09 caixiiaoyang