grok-1 segfault after having found the 8 GPUs

segfault after having found the 8 GPUs

Open tommasoboccali opened this issue 1 year ago • 4 comments

Dear all, I could setup a DGX-1 (8x A100) for grok-1, with a lot of difficulties in the env.

Still, all seems fine now, but at some point it just segfaults. Any idea???

Singularity> python run.py INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory INFO:rank:Initializing mesh for self.local_mesh_config=(1, 8) self.between_hosts_config=(1, 1)... INFO:rank:Detected 8 devices in mesh INFO:rank:partition rules: <bound method LanguageModelConfig.partition_rules of LanguageModelConfig(model=TransformerConfig(emb_size=6144, key_size=128, num_q_heads=48, num_kv_heads=8, num_layers=64, vocab_size=131072, widening_factor=8, attn_output_multiplier=0.08838834764831845, name=None, num_experts=8, capacity_factor=1.0, num_selected_experts=2, init_scale=1.0, shard_activations=True, data_axis='data', model_axis='model'), vocab_size=131072, pad_token=0, eos_token=2, sequence_len=8192, model_size=6144, embedding_init_scale=1.0, embedding_multiplier_scale=78.38367176906169, output_multiplier_scale=0.5773502691896257, name=None, fprop_dtype=<class 'jax.numpy.bfloat16'>, model_type=None, init_scale_override=None, shard_embeddings=True)> Segmentation fault (core dumped)

Mar 22 '24 22:03 tommasoboccali

I've seen that error before and it was cured by trying other versions of numpy believe it or not

Mar 25 '24 11:03 Sequential-circuits

No fix. Problem within numpy. Escalate with numpy

Mar 27 '24 04:03 Aareon

it's trying to load AMD GPU drivers on your Nvidia GPU platform, which obviously doesn't work (question is why the f it's trying to load AMD drivers)

Apr 16 '24 14:04 divinity76

grok-1 grok-1 copied to clipboard

segfault after having found the 8 GPUs

grok-1
grok-1 copied to clipboard